From owner-freebsd-fs@FreeBSD.ORG Tue Feb 3 13:36:39 2009 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 142271065675 for ; Tue, 3 Feb 2009 13:36:39 +0000 (UTC) (envelope-from morganw@chemikals.org) Received: from cdptpa-omtalb.mail.rr.com (cdptpa-omtalb.mail.rr.com [75.180.132.121]) by mx1.freebsd.org (Postfix) with ESMTP id C6A908FC20 for ; Tue, 3 Feb 2009 13:36:38 +0000 (UTC) (envelope-from morganw@chemikals.org) Received: from shop.chemikals.org ([75.182.5.141]) by cdptpa-omta06.mail.rr.com with ESMTP id <20090203130753.DRRB22988.cdptpa-omta06.mail.rr.com@shop.chemikals.org>; Tue, 3 Feb 2009 13:07:53 +0000 Received: from localhost (morganw@localhost [127.0.0.1]) by shop.chemikals.org (8.14.3/8.14.3) with ESMTP id n13D7qEG076456; Tue, 3 Feb 2009 08:07:53 -0500 (EST) (envelope-from morganw@chemikals.org) Date: Tue, 3 Feb 2009 08:07:47 -0500 (EST) From: Wesley Morgan To: =?ISO-8859-15?Q?Javier_Mart=EDn_Rueda?= In-Reply-To: <49883B45.3040606@diatel.upm.es> Message-ID: References: <49879C62.6070509@diatel.upm.es> <4987ED81.6080008@diatel.upm.es> <49883B45.3040606@diatel.upm.es> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="2866877063-78565638-1233666473=:61897" Cc: freebsd-fs@freebsd.org Subject: Re: Raidz2 pool with single disk failure is faulted X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 03 Feb 2009 13:36:39 -0000 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --2866877063-78565638-1233666473=:61897 Content-Type: TEXT/PLAIN; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8BIT On Tue, 3 Feb 2009, Javier Martín Rueda wrote: > I solved the problem. This is how I did it, in case one day it prevents > somebody from jumping in front of a train :-) > > First of all, I got some insight from various sites, mailing list archives, > documents, etc. Among them, maybe these two were more helpful: > > http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/051643.html > http://opensolaris.org/os/community/zfs/docs/ondiskformat0822.pdf > > I suspected that maybe my uberblock was somehow corrupted, and thought it > would be worthwhile to rollback to an earlier uberblock. However, my pool was > raidz2 and the examples I had seen about how to do this were with simple > pools, so I tried a different approach, which in the end proved very > successful: > > First, I added a couple of printf to vdev_uberblock_load_done(), which is in > /sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_label.c: > > --- vdev_label.c.orig 2009-02-03 13:14:35.000000000 +0100 > +++ vdev_label.c 2009-02-03 13:14:52.000000000 +0100 > @@ -659,10 +659,12 @@ > > if (zio->io_error == 0 && uberblock_verify(ub) == 0) { > mutex_enter(&spa->spa_uberblock_lock); > + printf("JMR: vdev_uberblock_load_done ub_txg=%qd > ub_timestamp=%qd\n", ub->ub_txg, ub->ub_timestamp); > if (vdev_uberblock_compare(ub, ubbest) > 0) > *ubbest = *ub; > mutex_exit(&spa->spa_uberblock_lock); > } > + printf("JMR: vdev_uberblock_load_done ubbest ub_txg=%qd > ub_timestamp=%qd\n", ubbest->ub_txg, ubbest->ub_timestamp); > > zio_buf_free(zio->io_data, zio->io_size); > } > > After compiling and loading the zfs.ko module, I executed "zpool import" and > these messages came up: > > ... > JMR: vdev_uberblock_load_done ub_txg=4254783 ub_timestamp=1233545538 > JMR: vdev_uberblock_load_done ubbest ub_txg=4254783 ub_timestamp=1233545538 > JMR: vdev_uberblock_load_done ub_txg=4254782 ub_timestamp=1233545533 > JMR: vdev_uberblock_load_done ubbest ub_txg=4254783 ub_timestamp=1233545538 > JMR: vdev_uberblock_load_done ub_txg=4254781 ub_timestamp=1233545528 > JMR: vdev_uberblock_load_done ubbest ub_txg=4254783 ub_timestamp=1233545538 > JMR: vdev_uberblock_load_done ub_txg=4254780 ub_timestamp=1233545523 > JMR: vdev_uberblock_load_done ubbest ub_txg=4254783 ub_timestamp=1233545538 > JMR: vdev_uberblock_load_done ub_txg=4254779 ub_timestamp=1233545518 > JMR: vdev_uberblock_load_done ubbest ub_txg=4254783 ub_timestamp=1233545538 > JMR: vdev_uberblock_load_done ub_txg=4254778 ub_timestamp=1233545513 > ... > JMR: vdev_uberblock_load_done ubbest ub_txg=4254783 ub_timestamp=1233545538 > > So, the uberblock with transaction group 4254783 was the most recent. I > convinced ZFS to use an earlier one with this patch (note the second > expression I added to the if statement): > > --- vdev_label.c.orig 2009-02-03 13:14:35.000000000 +0100 > +++ vdev_label.c 2009-02-03 13:25:43.000000000 +0100 > @@ -659,10 +659,12 @@ > > if (zio->io_error == 0 && uberblock_verify(ub) == 0) { > mutex_enter(&spa->spa_uberblock_lock); > - if (vdev_uberblock_compare(ub, ubbest) > 0) > + printf("JMR: vdev_uberblock_load_done ub_txg=%qd > ub_timestamp=%qd\n", ub->ub_txg, ub->ub_timestamp); > + if (vdev_uberblock_compare(ub, ubbest) > 0 && ub->ub_txg < > 4254783) > *ubbest = *ub; > mutex_exit(&spa->spa_uberblock_lock); > } > + printf("JMR: vdev_uberblock_load_done ubbest ub_txg=%qd > ub_timestamp=%qd\n", ubbest->ub_txg, ubbest->ub_timestamp); > > zio_buf_free(zio->io_data, zio->io_size); > } > > After compiling and loading the zfs.ko module, I executed "zpool import" and > the pool was still faulted. So, I decremented the limit txg to "< 4254782" > and this time the zpool came up as ONLINE. After crossing my fingers I > executed "zpool import z1", and it worked ok. No data loss, everything back > to normal. The only curious thing I've noticed is this: > > # zpool status > pool: z1 > state: ONLINE > status: One or more devices could not be used because the label is missing or > invalid. Sufficient replicas exist for the pool to continue > functioning in a degraded state. > action: Replace the device using 'zpool replace'. > see: http://www.sun.com/msg/ZFS-8000-4J > scrub: resilver completed with 0 errors on Tue Feb 3 09:26:40 2009 > config: > > NAME STATE READ WRITE CKSUM > z1 ONLINE 0 0 0 > raidz2 ONLINE 0 0 0 > mirror/gm0 ONLINE 0 0 0 > mirror/gm1 ONLINE 0 0 0 > da2 ONLINE 0 0 0 > da3 ONLINE 0 0 0 > 8076139616933977534 UNAVAIL 0 0 0 was /dev/da4 > da5 ONLINE 0 0 0 > da6 ONLINE 0 0 0 > da7 ONLINE 0 0 0 > > errors: No known data errors > > As you can see, the raidz2 vdev is marked as ONLINE, when I think it should > be DEGRADED. Nevertheless, the pool is readable and writeable, and so far I > haven't detected any problem. To be safe, I am extracting all the data and I > will recreate the pool again from scratch, just in case. > > > Pending questions: > > 1) Why did the "supposed corruption" happened in the first place? I advise > people not to mix disks from different zpools with the same name in the same > computer. That's what I did, and maybe it's what caused my problems. > > 2) Rolling back to an earlier uberblock seems to solve some faulted zpool > problems. I think it would be interesting to have a program that let you do > it in a user-friendly way (after warning you about the dangers, etc.). > It would be interesting to see if the txid from all of your labels was the same. I would highly advise scrubbing your array. I believe the reason that your "da4" is showing up with only a uuid is because zfs is now recognizing that the da4 it sees is not the correct one. Still very curious how you ended up in that situation. I wonder if you had corruption that was unknown before you removed da4. --2866877063-78565638-1233666473=:61897--