Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 01 Oct 2017 08:20:36 -0500
From:      Scott Bennett <bennett@sdf.org>
To:        freebsd-stable@freebsd.org
Cc:        Harry Schmalzbauer <freebsd@omnilan.de>
Subject:   Re: panic: Solaris(panic): blkptr invalid CHECKSUM1
Message-ID:  <201710011320.v91DKa1b029498@sdf.org>
In-Reply-To: <mailman.17.1506859200.76935.freebsd-stable@freebsd.org>
References:  <mailman.17.1506859200.76935.freebsd-stable@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
     On Sat, 30 Sep 2017 23:38:45 +0200 Harry Schmalzbauer <freebsd@omnilan.de>
wrote:
> Bez?glich Harry Schmalzbauer's Nachricht vom 30.09.2017 19:25 (localtime):
>>  Bez?glich Harry Schmalzbauer's Nachricht vom 30.09.2017 18:30 (localtime):
>>>  Bad surprise.
>>> Most likely I forgot to stop a PCIe-Passthrough NIC before shutting down
>>> that (byhve(8)) guest ? jhb@ helped my identifying this as the root
>>> cause for sever memory corruptions I regularly had (on stable-11).
>>>
>>> Now this time, corruption affected ZFS's RAM area, obviously.
>>>
>>> What I haven't expected is the panic.
>>> The machine has memory disk as root, so luckily I still can boot (from
>>> ZFS, ?> mdpreload rootfs) into single user mode, but early rc stage
>>> (most likely mounting ZFS datasets) leads to the following panic:
>>>
>>> Trying to mount root from ufs:/dev/ufs/cetusROOT []...
>>> panic: Solaris(panic): blkptr at 0xfffffe0005b6b000 has invalid CHECKSUM 1
>>> cpuid = 1
>>> KDB: stack backtrace:
>>>   [backtrace deleted  --SB]
>>> Haven't done any scrub attempts yet ? expectation is to get all datasets
>>> of the striped mirror pool back...
>>>
>>> Any hints highly appreciated.
>> Now it seems I'm in really big trouble.
>> Regular import doesn't work (also not if booted from cd9660).
>> I get all pools listed, but trying to import (unmounted) leads to the
>> same panic as initialy reported ? because rc is just doning the same.
>>
>> I booted into single user mode (which works since the bootpool isn't
>> affected and root is a memory disk from the bootpool)
>> and set vfs.zfs.recover=1.
>> But this time I don't even get the list of pools to import 'zpool'
>> import instantaniously leads to that panic:
>>
>> Solaris: WARNING: blkptr at 0xfffffe0005a8e000 has invalid CHECKSUM 1
>> Solaris: WARNING: blkptr at 0xfffffe0005a8e000 has invalid COMPRESS 0
>> Solaris: WARNING: blkptr at 0xfffffe0005a8e000 DVA 0 has invalid VDEV
>> 2337865727
>> Solaris: WARNING: blkptr at 0xfffffe0005a8e000 DVA 1 has invalid VDEV
>> 289407040
>> Solaris: WARNING: blkptr at 0xfffffe0005a8e000 DVA 2 has invalid VDEV
>> 3959586324
>>
>>
>> Fatal trap 12: page fault while in kernel mode
>> cpuid = 0; apic id = 00
>> fault virtual address   = 0x50
>> fault code              = supervisor read data, page not present
>> instruction pointer     = 0x20:0xffffffff812de904
>> stack pointer           = 0x28:0xfffffe043f6bcbc0
>> frame pointer           = 0x28:0xfffffe043f6bcbc0
>> code segment            = base 0x0, limit 0xfffff, type 0x1b
>>                         = DPL 0, pres 1, long 1, def32 0, gran 1
>> processor eflags        = interrupt enabled, resume, IOPL = 0
>> current process         = 44 (zpool)
>> trap number             = 12
>> panic: page fault
>> cpuid = 0
>
>?
>
>OpenIndiana also panics at regular import.
>Unfortunately I don't know the aequivalent of vfs.zfs.recover in OI.
>
>panic[cpu1]/thread=ffffff06dafe8be0: blkptr at ffffff06dbe63000 has
>invalid CHECKSUM 1
>
>Warning - stack not written to the dump buffer
>ffffff001f67f070 genunix:vcmn_err+42 ()
>ffffff001f67f0e0 zfs:zfs_panic_recover+51 ()
>ffffff001f67f140 zfs:zfs_blkptr_verify+8d ()
>ffffff001f67f220 zfs:zio_read+55 ()
>ffffff001f67f310 zfs:arc_read+662 ()
>ffffff001f67f370 zfs:traverse_prefetch_metadata+b5 ()
>ffffff001f67f450 zfs:traverse_visitbp+1c3 ()
>ffffff001f67f4e0 zfs:traverse_dnode+af ()
>ffffff001f67f5c0 zfs:traverse_visitbp+6dd ()
>ffffff001f67f720 zfs:traverse_impl+1a6 ()
>ffffff001f67f830 zfs:traverse_pool+9f ()
>ffffff001f67f8a0 zfs:spa_load_verify+1e6 ()
>ffffff001f67f990 zfs:spa_load_impl+e1c ()
>ffffff001f67fa30 zfs:spa_load+14e ()
>ffffff001f67fad0 zfs:spa_load_best+7a ()
>ffffff001f67fb90 zfs:spa_import+1b0 ()
>ffffff001f67fbe0 zfs:zfs_ioc_pool_import+10f ()
>ffffff001f67fc80 zfs:zfsdev_ioctl+4b7 ()
>ffffff001f67fcc0 genunix:cdev_ioctl+39 ()
>ffffff001f67fd10 specfs:spec_ioctl+60 ()
>ffffff001f67fda0 genunix:fop_ioctl+55 ()
>ffffff001f67fec0 genunix:ioctl+9b ()
>ffffff001f67ff10 unix:brand_sys_sysenter+1c9 ()
>
>This is a important lesson.
>My impression was that it's not possible to corrupt a complete pool, but
>there's always a way to recover healthy/redundant data.
>Now my striped mirror has all 4 devices healthy available, but all
>datasets seem to be lost.
>No problem for 450G (99,9_%), but there's a 80M dataset which I'm really
>missing :-(
>
>Unfortunately I don't know the DVA and blkptr internals, so I won't
>write a zfs_fsck(8) soon ;-)
>
>Does it make sense to dump the disks for further analysis?
>I need to recreate the pool because I need the machine's resources... :-(
>Any help highly appreciated!
>
     First, if it's not too late already, make a copy of the pool's cache file,
and save it somewhere in case you need it unchanged again.
     Can zdb(8) see it without causing a panic, i.e., without importing the
pool?  You might be able to track down more information if zdb can get you in.
     Another thing you could try with an admittedly very low probability of
working would be to try importing the pool with one drive of one mirror
missing, then try it with a different drive of one mirror, and so on the minor
chance that the critical error is limited to one drive.  If you find a case
where that works, then you could try to rebuild the missing drive and then run
a scrub.  Or vice versa.  This one is time-consuming, I would imagine, given
that each failure means a reboot. :-(


                                  Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet:   bennett at sdf.org   *xor*   bennett at freeshell.org  *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good  *
* objection to the introduction of that bane of all free governments *
* -- a standing army."                                               *
*    -- Gov. John Hancock, New York Journal, 28 January 1790         *
**********************************************************************



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201710011320.v91DKa1b029498>