Date: Thu, 29 Oct 2020 08:34:08 +0100 (CET) From: Christian Kratzer <ck-lists@cksoft.de> To: freebsd-fs@freebsd.org Subject: Re: 12.1-RELEASE-p7 panic in zio_free_issue_4_6 (fwd) Message-ID: <37c67834-204f-96b6-d37-70bfd832acee@cksoft.de>
next in thread | raw e-mail | index | archive | help
Hi, On Thu, 29 Oct 2020, Andriy Gapon wrote: > On 28/10/2020 15:41, Christian Kratzer wrote: >> Hi, >> >> one of my servers with 12.1-RELEASE-p7 started crashing with following >> >> Fatal trap 12: page fault while in kernel mode >> cpuid = 19; apic id = 31 >> fault virtual address = 0x30 >> fault code = supervisor write data, page not present >> instruction pointer = 0x20:0xffffffff826877f4 >> stack pointer = 0x28:0xfffffe011cefeaa0 >> frame pointer = 0x28:0xfffffe011cefeaa0 >> code segment = base 0x0, limit 0xfffff, type 0x1b >> = DPL 0, pres 1, long 1, def32 0, gran 1 >> processor eflags = interrupt enabled, resume, IOPL = 0 >> current process = 0 (zio_free_issue_2_3) >> trap number = 12 >> panic: page fault >> cpuid = 19 >> time = 1603797129 >> KDB: stack backtrace: >> #0 0xffffffff80c1d2f7 at kdb_backtrace+0x67 >> #1 0xffffffff80bd062d at vpanic+0x19d >> #2 0xffffffff80bd0483 at panic+0x43 >> #3 0xffffffff810a8dcc at trap_fatal+0x39c >> #4 0xffffffff810a8e19 at trap_pfault+0x49 >> #5 0xffffffff810a840f at trap+0x29f >> #6 0xffffffff81081c9c at calltrap+0x8 >> #7 0xffffffff8272a903 at zio_ddt_free+0x53 >> #8 0xffffffff82727b7c at zio_execute+0xac >> #9 0xffffffff80c2fad4 at taskqueue_run_locked+0x154 >> #10 0xffffffff80c30e08 at taskqueue_thread_loop+0x98 >> #11 0xffffffff80b90c43 at fork_exit+0x83 >> #12 0xffffffff81082cde at fork_trampoline+0xe >> Uptime: 1m12s >> Automatic reboot in 15 seconds - press a key on the console to abort >> >> >> I traced thigs down to importing one of the zpools. > > I suspect that you have a silent corruption on that pool (perhaps because of > non-ECC RAM?). This is on a DL380 G7 with 128GB of ECC ram. I have ran memtest on this server before without any defects being found. The sas disks are on an LSI hba. They also do not have defects according to smartctl. This of course does not rule out that there might be an issue with ram and I will need to recheck. Also I suspect the server might not have enough RAM for doing dedup on this 2 x 7 disk raid-z2 of 1.2GB drives. The pool was mostly in use for storing backups rsynced over night from two other servers. > What you see can happen if a block pointer has a deduplication bit set, but > the > block is not actually deduplicated or deduplication has never been enabled at > all. Could I have ran into an issue and bug by trying to do too much dedup on this pool ? > It would help -- with analysis -- to get a vmcore (kernel crash dump) and to > install the corresponding kernel debug symbols (if not already). I need to see why this server is not producing kernel crash dumps. My other setup does so I should be able to get this done. > As to recovery, I think that the best solution is to import the pool > read-only > and to copy important data elsewhere. Then re-create the pool. I was about to do that but the crash also happens when trying to import read-only. I will investigate if I can import based on an older snapshot or checkpoint but I am not sure if that will do what I want. I will keep this pool around for a couple of days and will try to get a crash dump from the system. After that I will have delete and recreate the pool and just wait for backups to roll back in. Greetings Christian -- Christian Kratzer CK Software GmbH Email: ck@cksoft.de Wildberger Weg 24/2 Phone: +49 7032 893 997 - 0 D-71126 Gaeufelden Fax: +49 7032 893 997 - 9 HRB 245288, Amtsgericht Stuttgart Mobile: +49 171 1947 843 Geschaeftsfuehrer: Christian Kratzer Web: http://www.cksoft.de/ From owner-freebsd-fs@freebsd.org Thu Oct 29 07:46:38 2020 Return-Path: <owner-freebsd-fs@freebsd.org> Delivered-To: freebsd-fs@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id AD4DB44479F for <freebsd-fs@mailman.nyi.freebsd.org>; Thu, 29 Oct 2020 07:46:38 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from relay12.mail.gandi.net (relay12.mail.gandi.net [217.70.178.232]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4CMHZQ1l0Cz4KLF for <freebsd-fs@freebsd.org>; Thu, 29 Oct 2020 07:46:37 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from [192.168.0.88] (east.meadow.volia.net [93.72.151.96]) (Authenticated sender: andriy.gapon@uabsd.com) by relay12.mail.gandi.net (Postfix) with ESMTPSA id C18CB200004; Thu, 29 Oct 2020 07:46:35 +0000 (UTC) Subject: Re: 12.1-RELEASE-p7 panic in zio_free_issue_4_6 To: Christian Kratzer <ck@cksoft.de> Cc: freebsd-fs@freebsd.org References: <a6a55583-f7b8-ee59-e3c7-4d1fcc5b1de8@cksoft.de> <474d086c-5a36-0db5-974f-ccfa0acbd871@FreeBSD.org> <d66296be-a1b9-b2a1-c2ec-164a7ba178@cksoft.de> From: Andriy Gapon <avg@FreeBSD.org> Openpgp: preference=signencrypt Autocrypt: addr=avg@FreeBSD.org; keydata= mDMEX1iFDhYJKwYBBAHaRw8BAQdAiu8JG/oLFkVkOAJqJc7Dx5KI/Q6C3SBI20EQm+DXnAu0 HkFuZHJpeSBHYXBvbiA8YXZnQEZyZWVCU0Qub3JnPoiWBBMWCAA+FiEEyCHHZM09l0OE3Ir/ 1A1+Gq8+L1EFAl9YhQ4CGwMFCQeEzgAFCwkIBwIGFQoJCAsCBBYCAwECHgECF4AACgkQ1A1+ Gq8+L1Fc0wD/ZjmhHfbCJywZU3aOxXIPjcz73FYEGMvqMCCLAWyLbSABALFL+1ZNrjV3BGjq 889cOYFuboA/Yn3eWezS+tfqYBsGuDgEX1iFDhIKKwYBBAGXVQEFAQEHQL6B20Xi600TrkpG P9fWjl7JtHNxqrHKhX6Kg7kgb4ILAwEIB4h+BBgWCAAmFiEEyCHHZM09l0OE3Ir/1A1+Gq8+ L1EFAl9YhQ4CGwwFCQeEzgAACgkQ1A1+Gq8+L1F3cgEAktp4h+IJUJxL1vn6zMOt//znni/J TanKfQuA8wGXcGkBAKpZJhqMkg+pKk7MGvJhgJ6nCpTZ+rMK6vZVZLUWc3QF Message-ID: <24b9cc11-0681-2f17-b634-d68878bc67ac@FreeBSD.org> Date: Thu, 29 Oct 2020 09:46:33 +0200 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:60.0) Gecko/20100101 Firefox/60.0 Thunderbird/60.9.0 MIME-Version: 1.0 In-Reply-To: <d66296be-a1b9-b2a1-c2ec-164a7ba178@cksoft.de> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 4CMHZQ1l0Cz4KLF X-Spamd-Bar: / Authentication-Results: mx1.freebsd.org; none X-Spamd-Result: default: False [0.00 / 15.00]; local_wl_from(0.00)[FreeBSD.org]; ASN(0.00)[asn:29169, ipnet:217.70.176.0/20, country:FR] X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.33 Precedence: list List-Id: Filesystems <freebsd-fs.freebsd.org> List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>, <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe> List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/> List-Post: <mailto:freebsd-fs@freebsd.org> List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help> List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>, <mailto:freebsd-fs-request@freebsd.org?subject=subscribe> X-List-Received-Date: Thu, 29 Oct 2020 07:46:38 -0000 On 29/10/2020 09:33, Christian Kratzer wrote: > Hi, > > On Thu, 29 Oct 2020, Andriy Gapon wrote: >> On 28/10/2020 15:41, Christian Kratzer wrote: >>> I traced thigs down to importing one of the zpools. >> >> I suspect that you have a silent corruption on that pool (perhaps because of >> non-ECC RAM?). > > This is on a DL380 G7 with 128GB of ECC ram. I have ran memtest on this server > before without any defects being found. > > The sas disks are on an LSI hba. They also do not have defects according to > smartctl. > > This of course does not rule out that there might be an issue with ram and > I will need to recheck. > > Also I suspect the server might not have enough RAM for doing dedup on this > 2 x 7 disk raid-z2 of 1.2GB drives. > > The pool was mostly in use for storing backups rsynced over night from two > other servers. > >> What you see can happen if a block pointer has a deduplication bit set, but the >> block is not actually deduplicated or deduplication has never been enabled at >> all. > > Could I have ran into an issue and bug by trying to do too much dedup on this > pool ? > >> It would help -- with analysis -- to get a vmcore (kernel crash dump) and to >> install the corresponding kernel debug symbols (if not already). > > I need to see why this server is not producing kernel crash dumps. My other setup > does so I should be able to get this done. > >> As to recovery, I think that the best solution is to import the pool read-only >> and to copy important data elsewhere. Then re-create the pool. > > I was about to do that but the crash also happens when trying to import read-only. > > I will investigate if I can import based on an older snapshot or checkpoint but > I am > not sure if that will do what I want. > > I will keep this pool around for a couple of days and will try to get a crash dump > from the system. After that I will have delete and recreate the pool and just > wait for backups to roll back in. Okay, let's see if we can get a vmcore. Otherwise, this is just a guess-work on my part. The problem could be very different from my initial impression. -- Andriy
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?37c67834-204f-96b6-d37-70bfd832acee>