Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 29 Oct 2020 08:34:08 +0100 (CET)
From:      Christian Kratzer <ck-lists@cksoft.de>
To:        freebsd-fs@freebsd.org
Subject:   Re: 12.1-RELEASE-p7 panic in zio_free_issue_4_6 (fwd)
Message-ID:  <37c67834-204f-96b6-d37-70bfd832acee@cksoft.de>

next in thread | raw e-mail | index | archive | help
Hi,

On Thu, 29 Oct 2020, Andriy Gapon wrote:
> On 28/10/2020 15:41, Christian Kratzer wrote:
>> Hi,
>> 
>> one of my servers with 12.1-RELEASE-p7 started crashing with following
>> 
>> Fatal trap 12: page fault while in kernel mode
>> cpuid = 19; apic id = 31
>> fault virtual address   = 0x30
>> fault code              = supervisor write data, page not present
>> instruction pointer     = 0x20:0xffffffff826877f4
>> stack pointer           = 0x28:0xfffffe011cefeaa0
>> frame pointer           = 0x28:0xfffffe011cefeaa0
>> code segment            = base 0x0, limit 0xfffff, type 0x1b
>>                         = DPL 0, pres 1, long 1, def32 0, gran 1
>> processor eflags        = interrupt enabled, resume, IOPL = 0
>> current process         = 0 (zio_free_issue_2_3)
>> trap number             = 12
>> panic: page fault
>> cpuid = 19
>> time = 1603797129
>> KDB: stack backtrace:
>> #0 0xffffffff80c1d2f7 at kdb_backtrace+0x67
>> #1 0xffffffff80bd062d at vpanic+0x19d
>> #2 0xffffffff80bd0483 at panic+0x43
>> #3 0xffffffff810a8dcc at trap_fatal+0x39c
>> #4 0xffffffff810a8e19 at trap_pfault+0x49
>> #5 0xffffffff810a840f at trap+0x29f
>> #6 0xffffffff81081c9c at calltrap+0x8
>> #7 0xffffffff8272a903 at zio_ddt_free+0x53
>> #8 0xffffffff82727b7c at zio_execute+0xac
>> #9 0xffffffff80c2fad4 at taskqueue_run_locked+0x154
>> #10 0xffffffff80c30e08 at taskqueue_thread_loop+0x98
>> #11 0xffffffff80b90c43 at fork_exit+0x83
>> #12 0xffffffff81082cde at fork_trampoline+0xe
>> Uptime: 1m12s
>> Automatic reboot in 15 seconds - press a key on the console to abort
>> 
>> 
>> I traced thigs down to importing one of the zpools.
> 
> I suspect that you have a silent corruption on that pool (perhaps because of
> non-ECC RAM?).

This is on a DL380 G7 with 128GB of ECC ram.  I have ran memtest on this server
before without any defects being found.

The sas disks are on an LSI hba. They also do not have defects according to
smartctl.

This of course does not rule out that there might be an issue with ram and
I will need to recheck.

Also I suspect the server might not have enough RAM for doing dedup on this
2 x 7 disk raid-z2 of 1.2GB drives.

The pool was mostly in use for storing backups rsynced over night from two
other servers.

> What you see can happen if a block pointer has a deduplication bit set, but 
> the
> block is not actually deduplicated or deduplication has never been enabled at 
> all.

Could I have ran into an issue and bug by trying to do too much dedup on this 
pool ?

> It would help -- with analysis -- to get a vmcore (kernel crash dump) and to
> install the corresponding kernel debug symbols (if not already).

I need to see why this server is not producing kernel crash dumps. My other 
setup
does so I should be able to get this done.

> As to recovery, I think that the best solution is to import the pool 
> read-only
> and to copy important data elsewhere.  Then re-create the pool.

I was about to do that but the crash also happens when trying to import 
read-only.

I will investigate if I can import based on an older snapshot or checkpoint but 
I am
not sure if that will do what I want.

I will keep this pool around for a couple of days and will try to get a crash 
dump
from the system.  After that I will have delete and recreate the pool and just
wait for backups to roll back in.

Greetings
Christian

-- 
Christian Kratzer                   CK Software GmbH
Email:   ck@cksoft.de               Wildberger Weg 24/2
Phone:   +49 7032 893 997 - 0       D-71126 Gaeufelden
Fax:     +49 7032 893 997 - 9       HRB 245288, Amtsgericht Stuttgart
Mobile:  +49 171 1947 843           Geschaeftsfuehrer: Christian Kratzer
Web:     http://www.cksoft.de/
From owner-freebsd-fs@freebsd.org  Thu Oct 29 07:46:38 2020
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id AD4DB44479F
 for <freebsd-fs@mailman.nyi.freebsd.org>; Thu, 29 Oct 2020 07:46:38 +0000 (UTC)
 (envelope-from avg@FreeBSD.org)
Received: from relay12.mail.gandi.net (relay12.mail.gandi.net [217.70.178.232])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 4CMHZQ1l0Cz4KLF
 for <freebsd-fs@freebsd.org>; Thu, 29 Oct 2020 07:46:37 +0000 (UTC)
 (envelope-from avg@FreeBSD.org)
Received: from [192.168.0.88] (east.meadow.volia.net [93.72.151.96])
 (Authenticated sender: andriy.gapon@uabsd.com)
 by relay12.mail.gandi.net (Postfix) with ESMTPSA id C18CB200004;
 Thu, 29 Oct 2020 07:46:35 +0000 (UTC)
Subject: Re: 12.1-RELEASE-p7 panic in zio_free_issue_4_6
To: Christian Kratzer <ck@cksoft.de>
Cc: freebsd-fs@freebsd.org
References: <a6a55583-f7b8-ee59-e3c7-4d1fcc5b1de8@cksoft.de>
 <474d086c-5a36-0db5-974f-ccfa0acbd871@FreeBSD.org>
 <d66296be-a1b9-b2a1-c2ec-164a7ba178@cksoft.de>
From: Andriy Gapon <avg@FreeBSD.org>
Openpgp: preference=signencrypt
Autocrypt: addr=avg@FreeBSD.org; keydata=
 mDMEX1iFDhYJKwYBBAHaRw8BAQdAiu8JG/oLFkVkOAJqJc7Dx5KI/Q6C3SBI20EQm+DXnAu0
 HkFuZHJpeSBHYXBvbiA8YXZnQEZyZWVCU0Qub3JnPoiWBBMWCAA+FiEEyCHHZM09l0OE3Ir/
 1A1+Gq8+L1EFAl9YhQ4CGwMFCQeEzgAFCwkIBwIGFQoJCAsCBBYCAwECHgECF4AACgkQ1A1+
 Gq8+L1Fc0wD/ZjmhHfbCJywZU3aOxXIPjcz73FYEGMvqMCCLAWyLbSABALFL+1ZNrjV3BGjq
 889cOYFuboA/Yn3eWezS+tfqYBsGuDgEX1iFDhIKKwYBBAGXVQEFAQEHQL6B20Xi600TrkpG
 P9fWjl7JtHNxqrHKhX6Kg7kgb4ILAwEIB4h+BBgWCAAmFiEEyCHHZM09l0OE3Ir/1A1+Gq8+
 L1EFAl9YhQ4CGwwFCQeEzgAACgkQ1A1+Gq8+L1F3cgEAktp4h+IJUJxL1vn6zMOt//znni/J
 TanKfQuA8wGXcGkBAKpZJhqMkg+pKk7MGvJhgJ6nCpTZ+rMK6vZVZLUWc3QF
Message-ID: <24b9cc11-0681-2f17-b634-d68878bc67ac@FreeBSD.org>
Date: Thu, 29 Oct 2020 09:46:33 +0200
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:60.0) Gecko/20100101
 Firefox/60.0 Thunderbird/60.9.0
MIME-Version: 1.0
In-Reply-To: <d66296be-a1b9-b2a1-c2ec-164a7ba178@cksoft.de>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 8bit
X-Rspamd-Queue-Id: 4CMHZQ1l0Cz4KLF
X-Spamd-Bar: /
Authentication-Results: mx1.freebsd.org;
	none
X-Spamd-Result: default: False [0.00 / 15.00];
 local_wl_from(0.00)[FreeBSD.org];
 ASN(0.00)[asn:29169, ipnet:217.70.176.0/20, country:FR]
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.33
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>;
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 29 Oct 2020 07:46:38 -0000

On 29/10/2020 09:33, Christian Kratzer wrote:
> Hi,
> 
> On Thu, 29 Oct 2020, Andriy Gapon wrote:
>> On 28/10/2020 15:41, Christian Kratzer wrote:
>>> I traced thigs down to importing one of the zpools.
>>
>> I suspect that you have a silent corruption on that pool (perhaps because of
>> non-ECC RAM?).
> 
> This is on a DL380 G7 with 128GB of ECC ram.  I have ran memtest on this server
> before without any defects being found.
> 
> The sas disks are on an LSI hba. They also do not have defects according to
> smartctl.
> 
> This of course does not rule out that there might be an issue with ram and
> I will need to recheck.
> 
> Also I suspect the server might not have enough RAM for doing dedup on this
> 2 x 7 disk raid-z2 of 1.2GB drives.
> 
> The pool was mostly in use for storing backups rsynced over night from two
> other servers.
> 
>> What you see can happen if a block pointer has a deduplication bit set, but the
>> block is not actually deduplicated or deduplication has never been enabled at
>> all.
> 
> Could I have ran into an issue and bug by trying to do too much dedup on this
> pool ?
> 
>> It would help -- with analysis -- to get a vmcore (kernel crash dump) and to
>> install the corresponding kernel debug symbols (if not already).
> 
> I need to see why this server is not producing kernel crash dumps. My other setup
> does so I should be able to get this done.
> 
>> As to recovery, I think that the best solution is to import the pool read-only
>> and to copy important data elsewhere.  Then re-create the pool.
> 
> I was about to do that but the crash also happens when trying to import read-only.
> 
> I will investigate if I can import based on an older snapshot or checkpoint but
> I am
> not sure if that will do what I want.
> 
> I will keep this pool around for a couple of days and will try to get a crash dump
> from the system.  After that I will have delete and recreate the pool and just
> wait for backups to roll back in.


Okay, let's see if we can get a vmcore.
Otherwise, this is just a guess-work on my part.
The problem could be very different from my initial impression.

-- 
Andriy





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?37c67834-204f-96b6-d37-70bfd832acee>