From owner-freebsd-fs@FreeBSD.ORG  Tue Feb  3 12:40:51 2009
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 3CC24106566C
	for <freebsd-fs@freebsd.org>; Tue,  3 Feb 2009 12:40:51 +0000 (UTC)
	(envelope-from jmrueda@diatel.upm.es)
Received: from aurora.diatel.upm.es (aurora.diatel.upm.es [138.100.49.70])
	by mx1.freebsd.org (Postfix) with ESMTP id BD3FE8FC1C
	for <freebsd-fs@freebsd.org>; Tue,  3 Feb 2009 12:40:50 +0000 (UTC)
	(envelope-from jmrueda@diatel.upm.es)
Received: from [172.29.9.10] ([172.29.9.10])
	by aurora.diatel.upm.es (8.14.3/8.14.3) with ESMTP id n13CeiHJ005877;
	Tue, 3 Feb 2009 13:40:45 +0100 (CET)
	(envelope-from jmrueda@diatel.upm.es)
Message-ID: <49883B45.3040606@diatel.upm.es>
Date: Tue, 03 Feb 2009 13:40:37 +0100
From: =?ISO-8859-1?Q?Javier_Mart=EDn_Rueda?= <jmrueda@diatel.upm.es>
User-Agent: Thunderbird 2.0.0.19 (Windows/20081209)
MIME-Version: 1.0
To: Wes Morgan <morganw@chemikals.org>, freebsd-fs@freebsd.org
References: <49879C62.6070509@diatel.upm.es>	<alpine.BSF.2.00.0902022155190.10729@ibyngvyr.purzvxnyf.bet>
	<4987ED81.6080008@diatel.upm.es>
In-Reply-To: <4987ED81.6080008@diatel.upm.es>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: 
Subject: Re: Raidz2 pool with single disk failure is faulted
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 03 Feb 2009 12:40:51 -0000

I solved the problem. This is how I did it, in case one day it prevents 
somebody from jumping in front of a train :-)

First of all, I got some insight from various sites, mailing list 
archives, documents, etc. Among them, maybe these two were more helpful:

http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/051643.html
http://opensolaris.org/os/community/zfs/docs/ondiskformat0822.pdf

I suspected that maybe my uberblock was somehow corrupted, and thought 
it would be worthwhile to rollback to an earlier uberblock. However, my 
pool was raidz2 and the examples I had seen about how to do this were 
with simple pools, so I tried a different approach, which in the end 
proved very successful:

First, I added a couple of printf to vdev_uberblock_load_done(), which 
is in /sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_label.c:

--- vdev_label.c.orig   2009-02-03 13:14:35.000000000 +0100
+++ vdev_label.c        2009-02-03 13:14:52.000000000 +0100
@@ -659,10 +659,12 @@

        if (zio->io_error == 0 && uberblock_verify(ub) == 0) {
                mutex_enter(&spa->spa_uberblock_lock);
+               printf("JMR: vdev_uberblock_load_done ub_txg=%qd 
ub_timestamp=%qd\n", ub->ub_txg, ub->ub_timestamp);
                if (vdev_uberblock_compare(ub, ubbest) > 0)
                        *ubbest = *ub;
                mutex_exit(&spa->spa_uberblock_lock);
        }
+       printf("JMR: vdev_uberblock_load_done ubbest ub_txg=%qd 
ub_timestamp=%qd\n", ubbest->ub_txg, ubbest->ub_timestamp);

        zio_buf_free(zio->io_data, zio->io_size);
 }

After compiling and loading the zfs.ko module, I executed "zpool import" 
and these messages came up:

...
JMR: vdev_uberblock_load_done ub_txg=4254783 ub_timestamp=1233545538
JMR: vdev_uberblock_load_done ubbest ub_txg=4254783 ub_timestamp=1233545538
JMR: vdev_uberblock_load_done ub_txg=4254782 ub_timestamp=1233545533
JMR: vdev_uberblock_load_done ubbest ub_txg=4254783 ub_timestamp=1233545538
JMR: vdev_uberblock_load_done ub_txg=4254781 ub_timestamp=1233545528
JMR: vdev_uberblock_load_done ubbest ub_txg=4254783 ub_timestamp=1233545538
JMR: vdev_uberblock_load_done ub_txg=4254780 ub_timestamp=1233545523
JMR: vdev_uberblock_load_done ubbest ub_txg=4254783 ub_timestamp=1233545538
JMR: vdev_uberblock_load_done ub_txg=4254779 ub_timestamp=1233545518
JMR: vdev_uberblock_load_done ubbest ub_txg=4254783 ub_timestamp=1233545538
JMR: vdev_uberblock_load_done ub_txg=4254778 ub_timestamp=1233545513
...
JMR: vdev_uberblock_load_done ubbest ub_txg=4254783 ub_timestamp=1233545538

So, the uberblock with transaction group 4254783 was the most recent. I 
convinced ZFS to use an earlier one with this patch (note the second 
expression I added to the if statement):

--- vdev_label.c.orig   2009-02-03 13:14:35.000000000 +0100
+++ vdev_label.c        2009-02-03 13:25:43.000000000 +0100
@@ -659,10 +659,12 @@

        if (zio->io_error == 0 && uberblock_verify(ub) == 0) {
                mutex_enter(&spa->spa_uberblock_lock);
-               if (vdev_uberblock_compare(ub, ubbest) > 0)
+               printf("JMR: vdev_uberblock_load_done ub_txg=%qd 
ub_timestamp=%qd\n", ub->ub_txg, ub->ub_timestamp);
+               if (vdev_uberblock_compare(ub, ubbest) > 0 && ub->ub_txg 
< 4254783)
                        *ubbest = *ub;
                mutex_exit(&spa->spa_uberblock_lock);
        }
+       printf("JMR: vdev_uberblock_load_done ubbest ub_txg=%qd 
ub_timestamp=%qd\n", ubbest->ub_txg, ubbest->ub_timestamp);

        zio_buf_free(zio->io_data, zio->io_size);
 }

After compiling and loading the zfs.ko module, I executed "zpool import" 
and the pool was still faulted. So, I decremented the limit txg to "< 
4254782" and this time the zpool came up as ONLINE. After crossing my 
fingers I executed "zpool import z1", and it worked ok. No data loss, 
everything back to normal. The only curious thing I've noticed is this:

# zpool status
  pool: z1
 state: ONLINE
status: One or more devices could not be used because the label is 
missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-4J
 scrub: resilver completed with 0 errors on Tue Feb  3 09:26:40 2009
config:

        NAME                     STATE     READ WRITE CKSUM
        z1                       ONLINE       0     0     0
          raidz2                 ONLINE       0     0     0
            mirror/gm0           ONLINE       0     0     0
            mirror/gm1           ONLINE       0     0     0
            da2                  ONLINE       0     0     0
            da3                  ONLINE       0     0     0
            8076139616933977534  UNAVAIL      0     0     0  was /dev/da4
            da5                  ONLINE       0     0     0
            da6                  ONLINE       0     0     0
            da7                  ONLINE       0     0     0

errors: No known data errors

As you can see, the raidz2 vdev is marked as ONLINE, when I think it 
should be DEGRADED. Nevertheless, the pool is readable and writeable, 
and so far I haven't detected any problem. To be safe, I am extracting 
all the data and I will recreate the pool again from scratch, just in case.


Pending questions:

1) Why did the "supposed corruption" happened in the first place? I 
advise people not to mix disks from different zpools with the same name 
in the same computer. That's what I did, and maybe it's what caused my 
problems.

2) Rolling back to an earlier uberblock seems to solve some faulted 
zpool problems. I think it would be interesting to have a program that 
let you do it in a user-friendly way (after warning you about the 
dangers, etc.).