Date: Tue, 03 Oct 2017 17:34:03 +0200 From: Harry Schmalzbauer <freebsd@omnilan.de> To: Andriy Gapon <avg@FreeBSD.org> Cc: freebsd-stable@FreeBSD.org Subject: Re: panic: Solaris(panic): blkptr invalid CHECKSUM1 Message-ID: <59D3ADEB.3010205@omnilan.de> In-Reply-To: <59D3A131.8040803@omnilan.de> References: <59CFC6A6.6030600@omnilan.de> <59CFD37A.8080009@omnilan.de> <59D00EE5.7090701@omnilan.de> <493e3eec-53c6-3846-0386-d5d7f4756b11@FreeBSD.org> <59D28550.3070700@omnilan.de> <59D34DA0.802@omnilan.de> <e8d8084b-5740-2645-69ae-a4e3967c7e59@FreeBSD.org> <59D39C88.4040501@omnilan.de> <4c144055-600c-89cf-13d5-0bf161726d1a@FreeBSD.org> <59D3A131.8040803@omnilan.de>
next in thread | previous in thread | raw e-mail | index | archive | help
Bezüglich Harry Schmalzbauer's Nachricht vom 03.10.2017 16:39 (localtime): > Bezüglich Andriy Gapon's Nachricht vom 03.10.2017 16:28 (localtime): >> On 03/10/2017 17:19, Harry Schmalzbauer wrote: >>> Have tried several different txg IDs, but the latest 5 or so lead to the >>> panic and some other random picked all claim missing devices... >>> Doh, if I only knew about -T some days ago, when I had all 4 devices >>> available. >> I don't think that the error is really about the missing devices. >> Most likely the real problem is that you are going too far back in history where >> the data required to import the pool is not present. It's just that there is no >> special error code to report that condition distinctly, so it gets interpreted >> as a missing device condition. > Sounds reasonable. > When the RAM-corruption happened, a live update was started, where > several pool availability checks were done. No data write. > Last data write were view KBytes some minutes before the corruption, and > the last significant ammount written to that pool was long time before that. > So I still have hope to find an importable txg ID. > > Are they strictly serialized? Seems so. Just for the records, I couldn't recover any data yet, but in general, if a pool isn't damaged that much, the following promising steps were the ones I got closest: I have attached dumps of the physical disks as md2 and md3. 'zpool import' offers cetusPsys DEGRADED mirror-0 DEGRADED 8178308212021996317 UNAVAIL cannot open md3 ONLINE mirror-1 DEGRADED md2p5 ONLINE 4036286347185017167 UNAVAIL cannot open Which is ḱnown to be corrupt. This time I also attached zdb(8) dumps (sparse files) of the remaining two disks, resp. partition. Now import offers this: pool: cetusPsys id: 13207378952432032998 state: ONLINE action: The pool can be imported using its name or numeric identifier. config: cetusPsys ONLINE mirror-0 ONLINE md5 ONLINE md3 ONLINE mirror-1 ONLINE md2p5 ONLINE md4 ONLINE 'zdb -ue cetusPsys' showed me the latest txg ID (3757573 in my case). So I decremented the txg ID by one and repeated until the following fatal panicing indicator vanished: loading space map for vdev 1 of 2, metaslab 108 of 109 ... WARNING: blkptr at 0x80e0ead00 has invalid CHECKSUM 1 WARNING: blkptr at 0x80e0ead00 has invalid COMPRESS 0 WARNING: blkptr at 0x80e0ead00 DVA 0 has invalid VDEV 2337865727 WARNING: blkptr at 0x80e0ead00 DVA 1 has invalid VDEV 289407040 WARNING: blkptr at 0x80e0ead00 DVA 2 has invalid VDEV 3959586324 Which was 'zdb -c -t 3757569 -AAA -e cetusPsys': Traversing all blocks to verify metadata checksums and verify nothing leaked ... loading space map for vdev 1 of 2, metaslab 108 of 109 ... 89.0M completed ( 6MB/s) estimated time remaining: 3hr 34min 47sec zdb_blkptr_cb: Got error 122 reading <69, 0, 0, c> -- skipping 86.8G completed ( 588MB/s) estimated time remaining: 0hr 00min 00sec Error counts: errno count 122 1 leaked space: vdev 0, offset 0xa01084200, size 512 leaked space: vdev 0, offset 0xd0dc23c00, size 512 leaked space: vdev 0, offset 0x2380182200, size 3072 leaked space: vdev 0, offset 0x2380189a00, size 1536 leaked space: vdev 0, offset 0x2380183000, size 1536 leaked space: vdev 0, offset 0x238039a200, size 2560 leaked space: vdev 0, offset 0x238039be00, size 18944 leaked space: vdev 0, offset 0x23801b3200, size 9216 leaked space: vdev 0, offset 0x33122a8800, size 512 leaked space: vdev 1, offset 0x2808f1600, size 512 leaked space: vdev 1, offset 0x2808f1e00, size 512 leaked space: vdev 1, offset 0x2808f2e00, size 4096 leaked space: vdev 1, offset 0x2808f1a00, size 512 leaked space: vdev 1, offset 0x9010e6c00, size 512 leaked space: vdev 1, offset 0x23c5ad9c00, size 512 leaked space: vdev 1, offset 0x2e00ad4800, size 512 leaked space: vdev 1, offset 0x2f0030b200, size 50176 leaked space: vdev 1, offset 0x2f000ca800, size 512 leaked space: vdev 1, offset 0x2f003a9800, size 15360 leaked space: vdev 1, offset 0x2f003af600, size 13312 leaked space: vdev 1, offset 0x2f00715c00, size 1024 leaked space: vdev 1, offset 0x2f003adc00, size 6144 leaked space: vdev 1, offset 0x2f00363600, size 38912 block traversal size 93540302336 != alloc 93540473344 (leaked 171008) bp count: 3670624 ganged count: 0 bp logical: 96083156992 avg: 26176 bp physical: 93308853248 avg: 25420 compression: 1.03 bp allocated: 93540302336 avg: 25483 compression: 1.03 bp deduped: 0 ref>1: 0 deduplication: 1.00 SPA allocated: 93540473344 used: 19.98% additional, non-pointer bps of type 0: 48879 Dittoed blocks on same vdev: 23422 In my case, import didn't work with the highest non-panicing txg ID: zpool import -o readonly=on -R /mnt -T 3757569 cetusPsys cannot import 'cetusPsys': one or more devices is currently unavailable Maybe anybody else will have more luck... just keep the "-T" parameter for zpool(8)'s import command in mind. thanks, -harry
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?59D3ADEB.3010205>