From owner-freebsd-fs@FreeBSD.ORG Sun Oct 11 16:27:22 2009 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 04863106566B; Sun, 11 Oct 2009 16:27:22 +0000 (UTC) (envelope-from alextzfs@googlemail.com) Received: from mail-bw0-f223.google.com (mail-bw0-f223.google.com [209.85.218.223]) by mx1.freebsd.org (Postfix) with ESMTP id E9BA68FC15; Sun, 11 Oct 2009 16:27:20 +0000 (UTC) Received: by bwz23 with SMTP id 23so1517223bwz.43 for ; Sun, 11 Oct 2009 09:27:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:cc:content-type; bh=54iHu6qyajsa8eQt5hhkD4EVYCy6appOunjA2TyocNU=; b=r2iB5su0l0mYNT+OXTd5PxUFr6r+6LKngts9TXRwu3Wv8bEuSnbj8uWQDhFuOtKxMx 81CJkPdp0fhQGTfZIdy+IJsy1ZKfZnCGyzD36Hp9cO7KjO1tXJsYY1+Ol4hq7bOk4nFM SSToQNR0dtX8KTNc+RvYgGE19DK3dO4S96NW0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; b=xTVer/TeSHHH22O5PMYphE0nRBgcONR/30wXaFSirKVTlbkRgxuAgEjgWBstsldW+R 0VRBSnKHucQ6gS60GuP8tb6AnralZ6ZKNHme6IiTAG5PIQkY8mZxlGcU/hPYHDF85kOL 01MESDufdLHRGDvybCnWc0awXx97EXgTDFFaQ= MIME-Version: 1.0 Received: by 10.204.10.6 with SMTP id n6mr4276078bkn.27.1255278438632; Sun, 11 Oct 2009 09:27:18 -0700 (PDT) In-Reply-To: <4d98b5320910110741w794c154cs22b527485c1938da@mail.gmail.com> References: <4d98b5320910110741w794c154cs22b527485c1938da@mail.gmail.com> Date: Sun, 11 Oct 2009 17:27:18 +0100 Message-ID: <4d98b5320910110927o62f8f588r9acdeb40a19587ea@mail.gmail.com> From: Alex Trull To: freebsd-fs@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: pjd@freebsd.org Subject: Re: zraid2 loses a single disk and becomes difficult to recover X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 11 Oct 2009 16:27:22 -0000 Well after trying alot of things (zpool import with or without cache file in place, etc), it randomly managed to mount the pool up, atleast, with errors - : zfs list output: cannot iterate filesystems: I/O error NAME USED AVAIL REFER MOUNTPOINT fatman 1.40T 1.70T 51.2K /fatman fatman/backup 100G 99.5G 95.5G /fatman/backup fatman/jail 422G 1.70T 60.5K /fatman/jail fatman/jail/havnor 198G 51.7G 112G /fatman/jail/havnor fatman/jail/mail 19.4G 30.6G 13.0G /fatman/jail/mail fatman/jail/syndicate 16.6G 103G 10.5G /fatman/jail/syndicate fatman/jail/thirdforces 159G 41.4G 78.1G /fatman/jail/thirdforces fatman/jail/web 24.8G 25.2G 22.3G /fatman/jail/web fatman/stash 913G 1.70T 913G /fatman/stash (end of the dmesg) JMR: vdev_uberblock_load_done ubbest ub_txg=46475461 ub_timestamp=1255231841 JMR: vdev_uberblock_load_done ub_txg=46481476 ub_timestamp=1255234263 JMR: vdev_uberblock_load_done ubbest ub_txg=46481476 ub_timestamp=1255234263 JMR: vdev_uberblock_load_done ubbest ub_txg=46475459 ub_timestamp=1255231780 JMR: vdev_uberblock_load_done ubbest ub_txg=46475458 ub_timestamp=1255231750 JMR: vdev_uberblock_load_done ub_txg=46481473 ub_timestamp=1255234263 JMR: vdev_uberblock_load_done ubbest ub_txg=46481473 ub_timestamp=1255234263 JMR: vdev_uberblock_load_done ubbest ub_txg=46481472 ub_timestamp=1255234263 Solaris: WARNING: can't open objset for fatman/jail/margaret Solaris: WARNING: can't open objset for fatman/jail/margaret Solaris: WARNING: ZFS replay transaction error 86, dataset fatman/jail/havnor, seq 0x25442, txtype 9 Solaris: WARNING: ZFS replay transaction error 86, dataset fatman/jail/mail, seq 0x1e200, txtype 9 Solaris: WARNING: ZFS replay transaction error 86, dataset fatman/jail/thirdforces, seq 0x732e3, txtype 9 [root@potjie /fatman/jail/mail]# zpool status -v pool: fatman state: DEGRADED status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: resilver in progress for 0h4m, 0.83% done, 8h21m to go config: NAME STATE READ WRITE CKSUM fatman DEGRADED 0 0 34 raidz2 DEGRADED 0 0 384 replacing DEGRADED 0 0 0 da2/old REMOVED 0 24 0 da2 ONLINE 0 0 0 1.71G resilvered ad4 ONLINE 0 0 0 21.3M resilvered ad6 ONLINE 0 0 0 21.4M resilvered ad20 ONLINE 0 0 0 21.3M resilvered ad22 ONLINE 0 0 0 21.3M resilvered ad17 ONLINE 0 0 0 21.3M resilvered da3 ONLINE 0 0 0 21.3M resilvered ad10 ONLINE 0 0 1 21.4M resilvered ad16 ONLINE 0 0 0 21.2M resilvered cache ad18 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: fatman/jail/margaret:<0x0> fatman/jail/syndicate:<0x0> fatman/jail/mail:<0x0> /fatman/jail/mail/tmp fatman/jail/havnor:<0x0> fatman/jail/thirdforces:<0x0> fatman/backup:<0x0> jail/margaret & backup isn't showing up in zfs list jail/syndicate is showing up but isn't viewable It seems the latest content on the better-looking zfs filesystems are quite recent. Any thoughts about what is going on ? I have snapshots for africa on these zfs filesystems - any suggestions on trying to get them back ? -- Alex 2009/10/11 Alex Trull > Hi All, > > My zraid2 has broken this morning on releng_7 zfs13. > > System failed this morning and came back without pool - having lost a disk. > > This is how I found the system: > > pool: fatman > state: FAULTED > status: One or more devices could not be used because the label is missing > or invalid. There are insufficient replicas for the pool to continue > functioning. > action: Destroy and re-create the pool from a backup source. > see: http://www.sun.com/msg/ZFS-8000-5E > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > fatman FAULTED 0 0 1 corrupted data > raidz2 DEGRADED 0 0 6 > da2 FAULTED 0 0 0 corrupted data > ad4 ONLINE 0 0 0 > ad6 ONLINE 0 0 0 > ad20 ONLINE 0 0 0 > ad22 ONLINE 0 0 0 > ad17 ONLINE 0 0 0 > da2 ONLINE 0 0 0 > ad10 ONLINE 0 0 0 > ad16 ONLINE 0 0 0 > > Initialy it complained that da3 had gone to da2 (da2 had failed and was no > longer seen) > > I replaced the original da2 and bumped what was originaly da3 back up to > da3 using the controllers ordering. > > [root@potjie /dev]# zpool status > pool: fatman > state: FAULTED > status: One or more devices could not be used because the label is missing > or invalid. There are insufficient replicas for the pool to continue > functioning. > action: Destroy and re-create the pool from a backup source. > see: http://www.sun.com/msg/ZFS-8000-5E > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > fatman FAULTED 0 0 1 corrupted data > raidz2 ONLINE 0 0 6 > da2 UNAVAIL 0 0 0 corrupted data > ad4 ONLINE 0 0 0 > ad6 ONLINE 0 0 0 > ad20 ONLINE 0 0 0 > ad22 ONLINE 0 0 0 > ad17 ONLINE 0 0 0 > da3 ONLINE 0 0 0 > ad10 ONLINE 0 0 0 > ad16 ONLINE 0 0 0 > > Issue looks very similar to this (JMR's issue) : > http://freebsd.monkey.org/freebsd-fs/200902/msg00017.html > > I've tried the methods there without much result. > > Using JMR's patches/debugs to see what is going on, this is what I got: > > JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 > ub_timestamp=1255246834 > JMR: vdev_uberblock_load_done ub_txg=46475459 ub_timestamp=1255231780 > JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 > ub_timestamp=1255246834 > JMR: vdev_uberblock_load_done ub_txg=46475458 ub_timestamp=1255231750 > JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 > ub_timestamp=1255246834 > JMR: vdev_uberblock_load_done ub_txg=46481473 ub_timestamp=1255234263 > JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 > ub_timestamp=1255246834 > JMR: vdev_uberblock_load_done ub_txg=46481472 ub_timestamp=1255234263 > JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 > ub_timestamp=1255246834 > > But JMR's patch still doesn't let me import even with a decremented txg > > I then had a look around the drives using zdb and some dirty script: > > [root@potjie /dev]# ls /dev/ad* /dev/da2 /dev/da3 | awk '{print "echo > "$1";zdb -l "$1" |grep txg"}' | sh > /dev/ad10 > txg=46488654 > txg=46488654 > txg=46488654 > txg=46488654 > /dev/ad16 > txg=46408223 <- old TXGid ? > txg=46408223 > txg=46408223 > txg=46408223 > /dev/ad17 > txg=46408223 <- old TXGid ? > txg=46408223 > txg=46408223 > txg=46408223 > /dev/ad18 (ssd) > /dev/ad19 (spare drive, removed from pool some time ago) > txg=0 > create_txg=0 > txg=0 > create_txg=0 > txg=0 > create_txg=0 > txg=0 > create_txg=0 > /dev/ad20 > txg=46488654 > txg=46488654 > txg=46488654 > txg=46488654 > /dev/ad22 > txg=46488654 > txg=46488654 > txg=46488654 > txg=46488654 > /dev/ad4 > txg=46488654 > txg=46488654 > txg=46488654 > txg=46488654 > /dev/ad6 > txg=46488654 > txg=46488654 > txg=46488654 > txg=46488654 > /dev/da2 < new drive replaced broken da2 > /dev/da3 > txg=46488654 > txg=46488654 > txg=46488654 > txg=46488654 > > I did not see any checksums or other issues on ad16 and ad17 previously, > and I do check regularly. > > Any thoughts on what to try next ? > > Regards, > > Alex > >