Date: Sun, 11 Oct 2009 17:27:18 +0100 From: Alex Trull <alextzfs@googlemail.com> To: freebsd-fs@freebsd.org Cc: pjd@freebsd.org Subject: Re: zraid2 loses a single disk and becomes difficult to recover Message-ID: <4d98b5320910110927o62f8f588r9acdeb40a19587ea@mail.gmail.com> In-Reply-To: <4d98b5320910110741w794c154cs22b527485c1938da@mail.gmail.com> References: <4d98b5320910110741w794c154cs22b527485c1938da@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Well after trying alot of things (zpool import with or without cache file in place, etc), it randomly managed to mount the pool up, atleast, with errors - : zfs list output: cannot iterate filesystems: I/O error NAME USED AVAIL REFER MOUNTPOINT fatman 1.40T 1.70T 51.2K /fatman fatman/backup 100G 99.5G 95.5G /fatman/backup fatman/jail 422G 1.70T 60.5K /fatman/jail fatman/jail/havnor 198G 51.7G 112G /fatman/jail/havnor fatman/jail/mail 19.4G 30.6G 13.0G /fatman/jail/mail fatman/jail/syndicate 16.6G 103G 10.5G /fatman/jail/syndicate fatman/jail/thirdforces 159G 41.4G 78.1G /fatman/jail/thirdforces fatman/jail/web 24.8G 25.2G 22.3G /fatman/jail/web fatman/stash 913G 1.70T 913G /fatman/stash (end of the dmesg) JMR: vdev_uberblock_load_done ubbest ub_txg=46475461 ub_timestamp=1255231841 JMR: vdev_uberblock_load_done ub_txg=46481476 ub_timestamp=1255234263 JMR: vdev_uberblock_load_done ubbest ub_txg=46481476 ub_timestamp=1255234263 JMR: vdev_uberblock_load_done ubbest ub_txg=46475459 ub_timestamp=1255231780 JMR: vdev_uberblock_load_done ubbest ub_txg=46475458 ub_timestamp=1255231750 JMR: vdev_uberblock_load_done ub_txg=46481473 ub_timestamp=1255234263 JMR: vdev_uberblock_load_done ubbest ub_txg=46481473 ub_timestamp=1255234263 JMR: vdev_uberblock_load_done ubbest ub_txg=46481472 ub_timestamp=1255234263 Solaris: WARNING: can't open objset for fatman/jail/margaret Solaris: WARNING: can't open objset for fatman/jail/margaret Solaris: WARNING: ZFS replay transaction error 86, dataset fatman/jail/havnor, seq 0x25442, txtype 9 Solaris: WARNING: ZFS replay transaction error 86, dataset fatman/jail/mail, seq 0x1e200, txtype 9 Solaris: WARNING: ZFS replay transaction error 86, dataset fatman/jail/thirdforces, seq 0x732e3, txtype 9 [root@potjie /fatman/jail/mail]# zpool status -v pool: fatman state: DEGRADED status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: resilver in progress for 0h4m, 0.83% done, 8h21m to go config: NAME STATE READ WRITE CKSUM fatman DEGRADED 0 0 34 raidz2 DEGRADED 0 0 384 replacing DEGRADED 0 0 0 da2/old REMOVED 0 24 0 da2 ONLINE 0 0 0 1.71G resilvered ad4 ONLINE 0 0 0 21.3M resilvered ad6 ONLINE 0 0 0 21.4M resilvered ad20 ONLINE 0 0 0 21.3M resilvered ad22 ONLINE 0 0 0 21.3M resilvered ad17 ONLINE 0 0 0 21.3M resilvered da3 ONLINE 0 0 0 21.3M resilvered ad10 ONLINE 0 0 1 21.4M resilvered ad16 ONLINE 0 0 0 21.2M resilvered cache ad18 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: fatman/jail/margaret:<0x0> fatman/jail/syndicate:<0x0> fatman/jail/mail:<0x0> /fatman/jail/mail/tmp fatman/jail/havnor:<0x0> fatman/jail/thirdforces:<0x0> fatman/backup:<0x0> jail/margaret & backup isn't showing up in zfs list jail/syndicate is showing up but isn't viewable It seems the latest content on the better-looking zfs filesystems are quite recent. Any thoughts about what is going on ? I have snapshots for africa on these zfs filesystems - any suggestions on trying to get them back ? -- Alex 2009/10/11 Alex Trull <alextzfs@googlemail.com> > Hi All, > > My zraid2 has broken this morning on releng_7 zfs13. > > System failed this morning and came back without pool - having lost a disk. > > This is how I found the system: > > pool: fatman > state: FAULTED > status: One or more devices could not be used because the label is missing > or invalid. There are insufficient replicas for the pool to continue > functioning. > action: Destroy and re-create the pool from a backup source. > see: http://www.sun.com/msg/ZFS-8000-5E > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > fatman FAULTED 0 0 1 corrupted data > raidz2 DEGRADED 0 0 6 > da2 FAULTED 0 0 0 corrupted data > ad4 ONLINE 0 0 0 > ad6 ONLINE 0 0 0 > ad20 ONLINE 0 0 0 > ad22 ONLINE 0 0 0 > ad17 ONLINE 0 0 0 > da2 ONLINE 0 0 0 > ad10 ONLINE 0 0 0 > ad16 ONLINE 0 0 0 > > Initialy it complained that da3 had gone to da2 (da2 had failed and was no > longer seen) > > I replaced the original da2 and bumped what was originaly da3 back up to > da3 using the controllers ordering. > > [root@potjie /dev]# zpool status > pool: fatman > state: FAULTED > status: One or more devices could not be used because the label is missing > or invalid. There are insufficient replicas for the pool to continue > functioning. > action: Destroy and re-create the pool from a backup source. > see: http://www.sun.com/msg/ZFS-8000-5E > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > fatman FAULTED 0 0 1 corrupted data > raidz2 ONLINE 0 0 6 > da2 UNAVAIL 0 0 0 corrupted data > ad4 ONLINE 0 0 0 > ad6 ONLINE 0 0 0 > ad20 ONLINE 0 0 0 > ad22 ONLINE 0 0 0 > ad17 ONLINE 0 0 0 > da3 ONLINE 0 0 0 > ad10 ONLINE 0 0 0 > ad16 ONLINE 0 0 0 > > Issue looks very similar to this (JMR's issue) : > http://freebsd.monkey.org/freebsd-fs/200902/msg00017.html > > I've tried the methods there without much result. > > Using JMR's patches/debugs to see what is going on, this is what I got: > > JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 > ub_timestamp=1255246834 > JMR: vdev_uberblock_load_done ub_txg=46475459 ub_timestamp=1255231780 > JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 > ub_timestamp=1255246834 > JMR: vdev_uberblock_load_done ub_txg=46475458 ub_timestamp=1255231750 > JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 > ub_timestamp=1255246834 > JMR: vdev_uberblock_load_done ub_txg=46481473 ub_timestamp=1255234263 > JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 > ub_timestamp=1255246834 > JMR: vdev_uberblock_load_done ub_txg=46481472 ub_timestamp=1255234263 > JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 > ub_timestamp=1255246834 > > But JMR's patch still doesn't let me import even with a decremented txg > > I then had a look around the drives using zdb and some dirty script: > > [root@potjie /dev]# ls /dev/ad* /dev/da2 /dev/da3 | awk '{print "echo > "$1";zdb -l "$1" |grep txg"}' | sh > /dev/ad10 > txg=46488654 > txg=46488654 > txg=46488654 > txg=46488654 > /dev/ad16 > txg=46408223 <- old TXGid ? > txg=46408223 > txg=46408223 > txg=46408223 > /dev/ad17 > txg=46408223 <- old TXGid ? > txg=46408223 > txg=46408223 > txg=46408223 > /dev/ad18 (ssd) > /dev/ad19 (spare drive, removed from pool some time ago) > txg=0 > create_txg=0 > txg=0 > create_txg=0 > txg=0 > create_txg=0 > txg=0 > create_txg=0 > /dev/ad20 > txg=46488654 > txg=46488654 > txg=46488654 > txg=46488654 > /dev/ad22 > txg=46488654 > txg=46488654 > txg=46488654 > txg=46488654 > /dev/ad4 > txg=46488654 > txg=46488654 > txg=46488654 > txg=46488654 > /dev/ad6 > txg=46488654 > txg=46488654 > txg=46488654 > txg=46488654 > /dev/da2 < new drive replaced broken da2 > /dev/da3 > txg=46488654 > txg=46488654 > txg=46488654 > txg=46488654 > > I did not see any checksums or other issues on ad16 and ad17 previously, > and I do check regularly. > > Any thoughts on what to try next ? > > Regards, > > Alex > >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4d98b5320910110927o62f8f588r9acdeb40a19587ea>