Date: Tue, 03 Oct 2017 17:34:03 +0200 From: Harry Schmalzbauer <freebsd@omnilan.de> To: Andriy Gapon <avg@FreeBSD.org> Cc: freebsd-stable@FreeBSD.org Subject: Re: panic: Solaris(panic): blkptr invalid CHECKSUM1 Message-ID: <59D3ADEB.3010205@omnilan.de> In-Reply-To: <59D3A131.8040803@omnilan.de> References: <59CFC6A6.6030600@omnilan.de> <59CFD37A.8080009@omnilan.de> <59D00EE5.7090701@omnilan.de> <493e3eec-53c6-3846-0386-d5d7f4756b11@FreeBSD.org> <59D28550.3070700@omnilan.de> <59D34DA0.802@omnilan.de> <e8d8084b-5740-2645-69ae-a4e3967c7e59@FreeBSD.org> <59D39C88.4040501@omnilan.de> <4c144055-600c-89cf-13d5-0bf161726d1a@FreeBSD.org> <59D3A131.8040803@omnilan.de>
next in thread | previous in thread | raw e-mail | index | archive | help
Bezüglich Harry Schmalzbauer's Nachricht vom 03.10.2017 16:39 (localtime):
> Bezüglich Andriy Gapon's Nachricht vom 03.10.2017 16:28 (localtime):
>> On 03/10/2017 17:19, Harry Schmalzbauer wrote:
>>> Have tried several different txg IDs, but the latest 5 or so lead to the
>>> panic and some other random picked all claim missing devices...
>>> Doh, if I only knew about -T some days ago, when I had all 4 devices
>>> available.
>> I don't think that the error is really about the missing devices.
>> Most likely the real problem is that you are going too far back in history where
>> the data required to import the pool is not present. It's just that there is no
>> special error code to report that condition distinctly, so it gets interpreted
>> as a missing device condition.
> Sounds reasonable.
> When the RAM-corruption happened, a live update was started, where
> several pool availability checks were done. No data write.
> Last data write were view KBytes some minutes before the corruption, and
> the last significant ammount written to that pool was long time before that.
> So I still have hope to find an importable txg ID.
>
> Are they strictly serialized?
Seems so.
Just for the records, I couldn't recover any data yet, but in general,
if a pool isn't damaged that much, the following promising steps were
the ones I got closest:
I have attached dumps of the physical disks as md2 and md3.
'zpool import' offers
cetusPsys DEGRADED
mirror-0 DEGRADED
8178308212021996317 UNAVAIL cannot open
md3 ONLINE
mirror-1 DEGRADED
md2p5 ONLINE
4036286347185017167 UNAVAIL cannot open
Which is ḱnown to be corrupt.
This time I also attached zdb(8) dumps (sparse files) of the remaining
two disks, resp. partition.
Now import offers this:
pool: cetusPsys
id: 13207378952432032998
state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:
cetusPsys ONLINE
mirror-0 ONLINE
md5 ONLINE
md3 ONLINE
mirror-1 ONLINE
md2p5 ONLINE
md4 ONLINE
'zdb -ue cetusPsys' showed me the latest txg ID (3757573 in my case).
So I decremented the txg ID by one and repeated until the following
fatal panicing indicator vanished:
loading space map for vdev 1 of 2, metaslab 108 of 109 ...
WARNING: blkptr at 0x80e0ead00 has invalid CHECKSUM 1
WARNING: blkptr at 0x80e0ead00 has invalid COMPRESS 0
WARNING: blkptr at 0x80e0ead00 DVA 0 has invalid VDEV 2337865727
WARNING: blkptr at 0x80e0ead00 DVA 1 has invalid VDEV 289407040
WARNING: blkptr at 0x80e0ead00 DVA 2 has invalid VDEV 3959586324
Which was 'zdb -c -t 3757569 -AAA -e cetusPsys':
Traversing all blocks to verify metadata checksums and verify nothing
leaked ...
loading space map for vdev 1 of 2, metaslab 108 of 109 ...
89.0M completed ( 6MB/s) estimated time remaining: 3hr 34min 47sec
zdb_blkptr_cb: Got error 122 reading <69, 0, 0, c> -- skipping
86.8G completed ( 588MB/s) estimated time remaining: 0hr 00min 00sec
Error counts:
errno count
122 1
leaked space: vdev 0, offset 0xa01084200, size 512
leaked space: vdev 0, offset 0xd0dc23c00, size 512
leaked space: vdev 0, offset 0x2380182200, size 3072
leaked space: vdev 0, offset 0x2380189a00, size 1536
leaked space: vdev 0, offset 0x2380183000, size 1536
leaked space: vdev 0, offset 0x238039a200, size 2560
leaked space: vdev 0, offset 0x238039be00, size 18944
leaked space: vdev 0, offset 0x23801b3200, size 9216
leaked space: vdev 0, offset 0x33122a8800, size 512
leaked space: vdev 1, offset 0x2808f1600, size 512
leaked space: vdev 1, offset 0x2808f1e00, size 512
leaked space: vdev 1, offset 0x2808f2e00, size 4096
leaked space: vdev 1, offset 0x2808f1a00, size 512
leaked space: vdev 1, offset 0x9010e6c00, size 512
leaked space: vdev 1, offset 0x23c5ad9c00, size 512
leaked space: vdev 1, offset 0x2e00ad4800, size 512
leaked space: vdev 1, offset 0x2f0030b200, size 50176
leaked space: vdev 1, offset 0x2f000ca800, size 512
leaked space: vdev 1, offset 0x2f003a9800, size 15360
leaked space: vdev 1, offset 0x2f003af600, size 13312
leaked space: vdev 1, offset 0x2f00715c00, size 1024
leaked space: vdev 1, offset 0x2f003adc00, size 6144
leaked space: vdev 1, offset 0x2f00363600, size 38912
block traversal size 93540302336 != alloc 93540473344 (leaked 171008)
bp count: 3670624
ganged count: 0
bp logical: 96083156992 avg: 26176
bp physical: 93308853248 avg: 25420 compression: 1.03
bp allocated: 93540302336 avg: 25483 compression: 1.03
bp deduped: 0 ref>1: 0 deduplication: 1.00
SPA allocated: 93540473344 used: 19.98%
additional, non-pointer bps of type 0: 48879
Dittoed blocks on same vdev: 23422
In my case, import didn't work with the highest non-panicing txg ID:
zpool import -o readonly=on -R /mnt -T 3757569 cetusPsys
cannot import 'cetusPsys': one or more devices is currently unavailable
Maybe anybody else will have more luck... just keep the "-T" parameter
for zpool(8)'s import command in mind.
thanks,
-harry
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?59D3ADEB.3010205>
