Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 02 Oct 2017 20:16:15 +0200
From:      Harry Schmalzbauer <freebsd@omnilan.de>
To:        Scott Bennett <bennett@sdf.org>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: panic: Solaris(panic): blkptr invalid CHECKSUM1
Message-ID:  <59D2826F.7020306@omnilan.de>
In-Reply-To: <201710011320.v91DKa1b029498@sdf.org>
References:  <mailman.17.1506859200.76935.freebsd-stable@freebsd.org> <201710011320.v91DKa1b029498@sdf.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Bezüglich Scott Bennett's Nachricht vom 01.10.2017 15:20 (localtime):
>      On Sat, 30 Sep 2017 23:38:45 +0200 Harry Schmalzbauer <freebsd@omnilan.de>
> wrote:

…
>>
>> OpenIndiana also panics at regular import.
>> Unfortunately I don't know the aequivalent of vfs.zfs.recover in OI.
>>
>> panic[cpu1]/thread=ffffff06dafe8be0: blkptr at ffffff06dbe63000 has
>> invalid CHECKSUM 1
>>
>> Warning - stack not written to the dump buffer
>> ffffff001f67f070 genunix:vcmn_err+42 ()
>> ffffff001f67f0e0 zfs:zfs_panic_recover+51 ()
>> ffffff001f67f140 zfs:zfs_blkptr_verify+8d ()
>> ffffff001f67f220 zfs:zio_read+55 ()
>> ffffff001f67f310 zfs:arc_read+662 ()
>> ffffff001f67f370 zfs:traverse_prefetch_metadata+b5 ()
>> ffffff001f67f450 zfs:traverse_visitbp+1c3 ()
>> ffffff001f67f4e0 zfs:traverse_dnode+af ()
>> ffffff001f67f5c0 zfs:traverse_visitbp+6dd ()
>> ffffff001f67f720 zfs:traverse_impl+1a6 ()
>> ffffff001f67f830 zfs:traverse_pool+9f ()
>> ffffff001f67f8a0 zfs:spa_load_verify+1e6 ()
>> ffffff001f67f990 zfs:spa_load_impl+e1c ()
>> ffffff001f67fa30 zfs:spa_load+14e ()
>> ffffff001f67fad0 zfs:spa_load_best+7a ()
>> ffffff001f67fb90 zfs:spa_import+1b0 ()
>> ffffff001f67fbe0 zfs:zfs_ioc_pool_import+10f ()
>> ffffff001f67fc80 zfs:zfsdev_ioctl+4b7 ()
>> ffffff001f67fcc0 genunix:cdev_ioctl+39 ()
>> ffffff001f67fd10 specfs:spec_ioctl+60 ()
>> ffffff001f67fda0 genunix:fop_ioctl+55 ()
>> ffffff001f67fec0 genunix:ioctl+9b ()
>> ffffff001f67ff10 unix:brand_sys_sysenter+1c9 ()
>>
>> This is a important lesson.
>> My impression was that it's not possible to corrupt a complete pool, but
>> there's always a way to recover healthy/redundant data.
>> Now my striped mirror has all 4 devices healthy available, but all
>> datasets seem to be lost.
>> No problem for 450G (99,9_%), but there's a 80M dataset which I'm really
>> missing :-(
>>
>> Unfortunately I don't know the DVA and blkptr internals, so I won't
>> write a zfs_fsck(8) soon ;-)
>>
>> Does it make sense to dump the disks for further analysis?
>> I need to recreate the pool because I need the machine's resources... :-(
>> Any help highly appreciated!
>>
>      First, if it's not too late already, make a copy of the pool's cache file,
> and save it somewhere in case you need it unchanged again.
>      Can zdb(8) see it without causing a panic, i.e., without importing the
> pool?  You might be able to track down more information if zdb can get you in.

Thank you very much for your help.

zdb(8) is able to get all config data, along with all dataset information.

For the records, I'll provide zdb(8) output beyond.

In the mean time I recreated the pool and the host is back to live.
Since other pools weren't affected and had plenty of space, I dumped two
of the 4 drives along with the zdb(8) -x dump, which I don't know what
it exactly dumps (all blocks accessed!?!; result is big sparse file, but
the time it took to write them down't allow them to have anything but
metadata, at best).

Attaching the two  native dumps as memory-disk works for "zpool import" :-)
To be continued as answer to Andriy Gaoon's reply from today...

>      Another thing you could try with an admittedly very low probability of
> working would be to try importing the pool with one drive of one mirror
> missing, then try it with a different drive of one mirror, and so on the minor
> chance that the critical error is limited to one drive.  If you find a case
> where that works, then you could try to rebuild the missing drive and then run
> a scrub.  Or vice versa.  This one is time-consuming, I would imagine, given

I did try, although I had no hope that this could change the picture,
since the cause of the incosistency wasn't drive related.
And as expected, I had no luck.

Dataset mos [META], ID 0, cr_txg 4, 19.2M, 6503550977762669098 objects

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         2    1   128K    512      0    512    0.00  DSL directory

Dataset mos [META], ID 0, cr_txg 4, 19.2M, 6503550977762669098 objects

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         2    1   128K    512      0    512    0.00  DSL directory

loading space map for vdev 1 of 2, metaslab 108 of 109 ...
error: blkptr at 0x80d726040 has invalid CHECKSUM 1

Traversing all blocks to verify checksums and verify nothing leaked ...

Assertion failed: (!BP_IS_EMBEDDED(bp) || BPE_GET_ETYPE(bp) ==
BP_EMBEDDED_TYPE_DATA), file
/usr/local/share/deploy-tools/RELENG_11/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c,
line 5220.
loading space map for vdev 1 of 2, metaslab 108 of 109 ...
error: blkptr at 0x80b482e80 has invalid CHECKSUM 1

Traversing all blocks to verify checksums and verify nothing leaked ...

Assertion failed: (!BP_IS_EMBEDDED(bp) || BPE_GET_ETYPE(bp) ==
BP_EMBEDDED_TYPE_DATA), file
/usr/local/share/deploy-tools/RELENG_11/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c,
line 5220.
loading space map for vdev 1 of 2, metaslab 108 of 109 ...
WARNING: Assertion failed: (!BP_IS_EMBEDDED(bp) || BPE_GET_ETYPE(bp) ==
BP_EMBEDDED_TYPE_DATA), file
/usr/local/share/deploy-tools/RELENG_11/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c,
line 5220.
blkptr at 0x80dac4440 has invalid CHECKSUM 1
WARNING: blkptr at 0x80dac4440 has invalid COMPRESS 0
WARNING: blkptr at 0x80dac4440 DVA 0 has invalid VDEV 2337865727
WARNING: blkptr at 0x80dac4440 DVA 1 has invalid VDEV 289407040Assertion
failed: ((hdr)->b_lsize << 9) > 0 (0x0 > 0x0), file
/usr/local/share/deploy-tools/RELENG_11/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c,
line 3128.
Assertion failed: ((hdr)->b_lsize << 9) != 0 (0x0 != 0x0), file
/usr/local/share/deploy-tools/RELENG_11/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c,
line 2301.
Assertion failed: (bytes > 0), file
/usr/local/share/deploy-tools/RELENG_11/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c,
line 4631.
Assertion failed: ((hdr)->b_lsize << 9) != 0 (0x0 != 0x0), file
/usr/local/share/deploy-tools/RELENG_11/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c,
line 2301.
Assertion failed: ((hdr)->b_lsize << 9) != 0 (0x0 != 0x0), file
/usr/local/share/deploy-tools/RELENG_11/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c,
line 2301.
WARNING: blkptr at 0x806d5ccc0 has invalid TYPE 207
WARNING: blkptr at 0x806d5ccc0 has invalid ETYPE 188

WARNING: blkptr at 0x80dac4440 DVA 2 has invalid VDEV 3959586324
Assertion failed: (!BP_IS_EMBEDDED(bp)), file
/usr/local/share/deploy-tools/RELENG_11/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c,
line 1242.
Assertion failed: (zio->io_error != 0), file
/usr/local/share/deploy-tools/RELENG_11/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c,
line 619.

Thanks,

-harry



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?59D2826F.7020306>