From owner-freebsd-fs@FreeBSD.ORG Tue Feb 3 13:41:27 2009 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 948D9106567C for ; Tue, 3 Feb 2009 13:41:27 +0000 (UTC) (envelope-from jmrueda@diatel.upm.es) Received: from aurora.diatel.upm.es (aurora.diatel.upm.es [138.100.49.70]) by mx1.freebsd.org (Postfix) with ESMTP id 0F3B68FC18 for ; Tue, 3 Feb 2009 13:41:26 +0000 (UTC) (envelope-from jmrueda@diatel.upm.es) Received: from [172.29.9.10] ([172.29.9.10]) by aurora.diatel.upm.es (8.14.3/8.14.3) with ESMTP id n13DfLFC006032; Tue, 3 Feb 2009 14:41:21 +0100 (CET) (envelope-from jmrueda@diatel.upm.es) Message-ID: <49884979.7080103@diatel.upm.es> Date: Tue, 03 Feb 2009 14:41:13 +0100 From: =?ISO-8859-1?Q?Javier_Mart=EDn_Rueda?= User-Agent: Thunderbird 2.0.0.19 (Windows/20081209) MIME-Version: 1.0 To: Wesley Morgan References: <49879C62.6070509@diatel.upm.es> <4987ED81.6080008@diatel.upm.es> <49883B45.3040606@diatel.upm.es> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit Cc: freebsd-fs@freebsd.org Subject: Re: Raidz2 pool with single disk failure is faulted X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 03 Feb 2009 13:41:28 -0000 Wesley Morgan escribió: > On Tue, 3 Feb 2009, Javier Martín Rueda wrote: > >> I solved the problem. This is how I did it, in case one day it >> prevents somebody from jumping in front of a train :-) >> >> First of all, I got some insight from various sites, mailing list >> archives, documents, etc. Among them, maybe these two were more helpful: >> >> http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/051643.html >> >> http://opensolaris.org/os/community/zfs/docs/ondiskformat0822.pdf >> >> I suspected that maybe my uberblock was somehow corrupted, and >> thought it would be worthwhile to rollback to an earlier uberblock. >> However, my pool was raidz2 and the examples I had seen about how to >> do this were with simple pools, so I tried a different approach, >> which in the end proved very successful: >> >> First, I added a couple of printf to vdev_uberblock_load_done(), >> which is in >> /sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_label.c: >> >> --- vdev_label.c.orig 2009-02-03 13:14:35.000000000 +0100 >> +++ vdev_label.c 2009-02-03 13:14:52.000000000 +0100 >> @@ -659,10 +659,12 @@ >> >> if (zio->io_error == 0 && uberblock_verify(ub) == 0) { >> mutex_enter(&spa->spa_uberblock_lock); >> + printf("JMR: vdev_uberblock_load_done ub_txg=%qd >> ub_timestamp=%qd\n", ub->ub_txg, ub->ub_timestamp); >> if (vdev_uberblock_compare(ub, ubbest) > 0) >> *ubbest = *ub; >> mutex_exit(&spa->spa_uberblock_lock); >> } >> + printf("JMR: vdev_uberblock_load_done ubbest ub_txg=%qd >> ub_timestamp=%qd\n", ubbest->ub_txg, ubbest->ub_timestamp); >> >> zio_buf_free(zio->io_data, zio->io_size); >> } >> >> After compiling and loading the zfs.ko module, I executed "zpool >> import" and these messages came up: >> >> ... >> JMR: vdev_uberblock_load_done ub_txg=4254783 ub_timestamp=1233545538 >> JMR: vdev_uberblock_load_done ubbest ub_txg=4254783 >> ub_timestamp=1233545538 >> JMR: vdev_uberblock_load_done ub_txg=4254782 ub_timestamp=1233545533 >> JMR: vdev_uberblock_load_done ubbest ub_txg=4254783 >> ub_timestamp=1233545538 >> JMR: vdev_uberblock_load_done ub_txg=4254781 ub_timestamp=1233545528 >> JMR: vdev_uberblock_load_done ubbest ub_txg=4254783 >> ub_timestamp=1233545538 >> JMR: vdev_uberblock_load_done ub_txg=4254780 ub_timestamp=1233545523 >> JMR: vdev_uberblock_load_done ubbest ub_txg=4254783 >> ub_timestamp=1233545538 >> JMR: vdev_uberblock_load_done ub_txg=4254779 ub_timestamp=1233545518 >> JMR: vdev_uberblock_load_done ubbest ub_txg=4254783 >> ub_timestamp=1233545538 >> JMR: vdev_uberblock_load_done ub_txg=4254778 ub_timestamp=1233545513 >> ... >> JMR: vdev_uberblock_load_done ubbest ub_txg=4254783 >> ub_timestamp=1233545538 >> >> So, the uberblock with transaction group 4254783 was the most recent. >> I convinced ZFS to use an earlier one with this patch (note the >> second expression I added to the if statement): >> >> --- vdev_label.c.orig 2009-02-03 13:14:35.000000000 +0100 >> +++ vdev_label.c 2009-02-03 13:25:43.000000000 +0100 >> @@ -659,10 +659,12 @@ >> >> if (zio->io_error == 0 && uberblock_verify(ub) == 0) { >> mutex_enter(&spa->spa_uberblock_lock); >> - if (vdev_uberblock_compare(ub, ubbest) > 0) >> + printf("JMR: vdev_uberblock_load_done ub_txg=%qd >> ub_timestamp=%qd\n", ub->ub_txg, ub->ub_timestamp); >> + if (vdev_uberblock_compare(ub, ubbest) > 0 && >> ub->ub_txg < 4254783) >> *ubbest = *ub; >> mutex_exit(&spa->spa_uberblock_lock); >> } >> + printf("JMR: vdev_uberblock_load_done ubbest ub_txg=%qd >> ub_timestamp=%qd\n", ubbest->ub_txg, ubbest->ub_timestamp); >> >> zio_buf_free(zio->io_data, zio->io_size); >> } >> >> After compiling and loading the zfs.ko module, I executed "zpool >> import" and the pool was still faulted. So, I decremented the limit >> txg to "< 4254782" and this time the zpool came up as ONLINE. After >> crossing my fingers I executed "zpool import z1", and it worked ok. >> No data loss, everything back to normal. The only curious thing I've >> noticed is this: >> >> # zpool status >> pool: z1 >> state: ONLINE >> status: One or more devices could not be used because the label is >> missing or >> invalid. Sufficient replicas exist for the pool to continue >> functioning in a degraded state. >> action: Replace the device using 'zpool replace'. >> see: http://www.sun.com/msg/ZFS-8000-4J >> scrub: resilver completed with 0 errors on Tue Feb 3 09:26:40 2009 >> config: >> >> NAME STATE READ WRITE CKSUM >> z1 ONLINE 0 0 0 >> raidz2 ONLINE 0 0 0 >> mirror/gm0 ONLINE 0 0 0 >> mirror/gm1 ONLINE 0 0 0 >> da2 ONLINE 0 0 0 >> da3 ONLINE 0 0 0 >> 8076139616933977534 UNAVAIL 0 0 0 was /dev/da4 >> da5 ONLINE 0 0 0 >> da6 ONLINE 0 0 0 >> da7 ONLINE 0 0 0 >> >> errors: No known data errors >> >> As you can see, the raidz2 vdev is marked as ONLINE, when I think it >> should be DEGRADED. Nevertheless, the pool is readable and writeable, >> and so far I haven't detected any problem. To be safe, I am >> extracting all the data and I will recreate the pool again from >> scratch, just in case. >> >> >> Pending questions: >> >> 1) Why did the "supposed corruption" happened in the first place? I >> advise people not to mix disks from different zpools with the same >> name in the same computer. That's what I did, and maybe it's what >> caused my problems. >> >> 2) Rolling back to an earlier uberblock seems to solve some faulted >> zpool problems. I think it would be interesting to have a program >> that let you do it in a user-friendly way (after warning you about >> the dangers, etc.). >> > > > It would be interesting to see if the txid from all of your labels was > the same. I would highly advise scrubbing your array. I did a zdb -l in all the healthy disks, and all the labels (4 copies x 7 devices) were identical, except for the "guid" field at the beginning. That's the vdev's guid, so I think it's normal it's different for each disk. The txg field was identical in all of them. > > I believe the reason that your "da4" is showing up with only a uuid is > because zfs is now recognizing that the da4 it sees is not the correct > one. Still very curious how you ended up in that situation. I wonder > if you had corruption that was unknown before you removed da4. Definitely the current da4 has nothing to do with the zpool. First it belonged to a different zpool and later I zeroed the beginning and end. The GUID that is listed in "zpool status" is the same one that appears in the zpool labels for the old da4. I don't recall seeing any corruption before, and I scrubbed the pool from time to time. By the way, thinking again about this, the timestamp on the most recent uberblock was 6:32 CET, which also coincides with the time that the server froze, while the change of disks took place about 2-3 hours later. So, maybe the change of disks had nothing to do with all this after all. The disks are connected to a RAID controller, although they are exported in pass-through mode.