From owner-freebsd-fs@FreeBSD.ORG  Sun Oct 11 15:02:14 2009
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id A66F7106568B
	for <freebsd-fs@freebsd.org>; Sun, 11 Oct 2009 15:02:14 +0000 (UTC)
	(envelope-from alextzfs@googlemail.com)
Received: from mail-fx0-f222.google.com (mail-fx0-f222.google.com
	[209.85.220.222])
	by mx1.freebsd.org (Postfix) with ESMTP id 057C08FC17
	for <freebsd-fs@freebsd.org>; Sun, 11 Oct 2009 15:02:13 +0000 (UTC)
Received: by fxm22 with SMTP id 22so7614204fxm.36
	for <freebsd-fs@freebsd.org>; Sun, 11 Oct 2009 08:02:13 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=googlemail.com; s=gamma;
	h=domainkey-signature:mime-version:received:date:message-id:subject
	:from:to:cc:content-type;
	bh=saoYg6NmPJVBFxE6PWECNo4OA7DWZ3dhyQgha1CDf38=;
	b=TFHSY8oXhqNyut7cQupQuDLwYcyTFMCwLJdKSkT2DI6/TcNgNJVlXkweK0CLX+0BDN
	7/eHA9yx71AmtR16f0KXsNwVicxZ4ZdisqorI8jc5ibduIAK+iAZcBZzTogkXCCsvbv4
	3Ou8nGWZ3qv2EZWVTitzoDsy93a1X9OsGUhuM=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=gamma;
	h=mime-version:date:message-id:subject:from:to:cc:content-type;
	b=lVvgdYwBbwmnf1YBKD1Z7aOgFrb3wIFv/OmmCEmWR/mUh+DT2pmwd/OEhFY0MKeniF
	IGNW9bjt2R6woKcSMwTT2lkLK1U+A0Euq6X9I3S8PWh0HHTRaIt78Wg3nOfKp3JEj2XU
	gN4wPPS87P2UI3bcNQUgs6HIAVyiHkT5S/f+Y=
MIME-Version: 1.0
Received: by 10.204.23.77 with SMTP id q13mr4184666bkb.14.1255272111747; Sun, 
	11 Oct 2009 07:41:51 -0700 (PDT)
Date: Sun, 11 Oct 2009 15:41:51 +0100
Message-ID: <4d98b5320910110741w794c154cs22b527485c1938da@mail.gmail.com>
From: Alex Trull <alextzfs@googlemail.com>
To: freebsd-fs@freebsd.org
Content-Type: text/plain; charset=ISO-8859-1
X-Content-Filtered-By: Mailman/MimeDel 2.1.5
Cc: pjd@freebsd.org
Subject: zraid2 loses a single disk and becomes difficult to recover
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 11 Oct 2009 15:02:14 -0000

Hi All,

My zraid2 has broken this morning on releng_7 zfs13.

System failed this morning and came back without pool - having lost a disk.

This is how I found the system:

  pool: fatman
 state: FAULTED
status: One or more devices could not be used because the label is missing
    or invalid.  There are insufficient replicas for the pool to continue
    functioning.
action: Destroy and re-create the pool from a backup source.
   see: http://www.sun.com/msg/ZFS-8000-5E
 scrub: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    fatman      FAULTED      0     0     1  corrupted data
      raidz2    DEGRADED     0     0     6
        da2     FAULTED      0     0     0  corrupted data
        ad4     ONLINE       0     0     0
        ad6     ONLINE       0     0     0
        ad20    ONLINE       0     0     0
        ad22    ONLINE       0     0     0
        ad17    ONLINE       0     0     0
        da2     ONLINE       0     0     0
        ad10    ONLINE       0     0     0
        ad16    ONLINE       0     0     0

Initialy it complained that da3 had gone to da2 (da2 had failed and was no
longer seen)

I replaced the original da2 and bumped what was originaly da3 back up to da3
using the controllers ordering.

[root@potjie /dev]# zpool status
  pool: fatman
 state: FAULTED
status: One or more devices could not be used because the label is missing
    or invalid.  There are insufficient replicas for the pool to continue
    functioning.
action: Destroy and re-create the pool from a backup source.
   see: http://www.sun.com/msg/ZFS-8000-5E
 scrub: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    fatman      FAULTED      0     0     1  corrupted data
      raidz2    ONLINE       0     0     6
        da2     UNAVAIL      0     0     0  corrupted data
        ad4     ONLINE       0     0     0
        ad6     ONLINE       0     0     0
        ad20    ONLINE       0     0     0
        ad22    ONLINE       0     0     0
        ad17    ONLINE       0     0     0
        da3     ONLINE       0     0     0
        ad10    ONLINE       0     0     0
        ad16    ONLINE       0     0     0

Issue looks very similar to this (JMR's issue) :
http://freebsd.monkey.org/freebsd-fs/200902/msg00017.html

I've tried the methods there without much result.

Using JMR's patches/debugs to see what is going on, this is what I got:

JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 ub_timestamp=1255246834
JMR: vdev_uberblock_load_done ub_txg=46475459 ub_timestamp=1255231780
JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 ub_timestamp=1255246834
JMR: vdev_uberblock_load_done ub_txg=46475458 ub_timestamp=1255231750
JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 ub_timestamp=1255246834
JMR: vdev_uberblock_load_done ub_txg=46481473 ub_timestamp=1255234263
JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 ub_timestamp=1255246834
JMR: vdev_uberblock_load_done ub_txg=46481472 ub_timestamp=1255234263
JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 ub_timestamp=1255246834

But JMR's patch still doesn't let me import even with a decremented txg

I then had a look around the drives using zdb and some dirty script:

[root@potjie /dev]# ls /dev/ad* /dev/da2 /dev/da3 | awk '{print "echo
"$1";zdb -l "$1" |grep txg"}' | sh
/dev/ad10
    txg=46488654
    txg=46488654
    txg=46488654
    txg=46488654
/dev/ad16
    txg=46408223 <- old TXGid ?
    txg=46408223
    txg=46408223
    txg=46408223
/dev/ad17
    txg=46408223 <- old TXGid ?
    txg=46408223
    txg=46408223
    txg=46408223
/dev/ad18 (ssd)
/dev/ad19 (spare drive, removed from pool some time ago)
    txg=0
    create_txg=0
    txg=0
    create_txg=0
    txg=0
    create_txg=0
    txg=0
    create_txg=0
/dev/ad20
    txg=46488654
    txg=46488654
    txg=46488654
    txg=46488654
/dev/ad22
    txg=46488654
    txg=46488654
    txg=46488654
    txg=46488654
/dev/ad4
    txg=46488654
    txg=46488654
    txg=46488654
    txg=46488654
/dev/ad6
    txg=46488654
    txg=46488654
    txg=46488654
    txg=46488654
/dev/da2 < new drive replaced broken da2
/dev/da3
    txg=46488654
    txg=46488654
    txg=46488654
    txg=46488654

I did not see any checksums or other issues on ad16 and ad17 previously, and
I do check regularly.

Any thoughts on what to try next ?

Regards,

Alex