From owner-freebsd-fs@FreeBSD.ORG Sun Oct 11 15:02:14 2009 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A66F7106568B for ; Sun, 11 Oct 2009 15:02:14 +0000 (UTC) (envelope-from alextzfs@googlemail.com) Received: from mail-fx0-f222.google.com (mail-fx0-f222.google.com [209.85.220.222]) by mx1.freebsd.org (Postfix) with ESMTP id 057C08FC17 for ; Sun, 11 Oct 2009 15:02:13 +0000 (UTC) Received: by fxm22 with SMTP id 22so7614204fxm.36 for ; Sun, 11 Oct 2009 08:02:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=domainkey-signature:mime-version:received:date:message-id:subject :from:to:cc:content-type; bh=saoYg6NmPJVBFxE6PWECNo4OA7DWZ3dhyQgha1CDf38=; b=TFHSY8oXhqNyut7cQupQuDLwYcyTFMCwLJdKSkT2DI6/TcNgNJVlXkweK0CLX+0BDN 7/eHA9yx71AmtR16f0KXsNwVicxZ4ZdisqorI8jc5ibduIAK+iAZcBZzTogkXCCsvbv4 3Ou8nGWZ3qv2EZWVTitzoDsy93a1X9OsGUhuM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:cc:content-type; b=lVvgdYwBbwmnf1YBKD1Z7aOgFrb3wIFv/OmmCEmWR/mUh+DT2pmwd/OEhFY0MKeniF IGNW9bjt2R6woKcSMwTT2lkLK1U+A0Euq6X9I3S8PWh0HHTRaIt78Wg3nOfKp3JEj2XU gN4wPPS87P2UI3bcNQUgs6HIAVyiHkT5S/f+Y= MIME-Version: 1.0 Received: by 10.204.23.77 with SMTP id q13mr4184666bkb.14.1255272111747; Sun, 11 Oct 2009 07:41:51 -0700 (PDT) Date: Sun, 11 Oct 2009 15:41:51 +0100 Message-ID: <4d98b5320910110741w794c154cs22b527485c1938da@mail.gmail.com> From: Alex Trull To: freebsd-fs@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: pjd@freebsd.org Subject: zraid2 loses a single disk and becomes difficult to recover X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 11 Oct 2009 15:02:14 -0000 Hi All, My zraid2 has broken this morning on releng_7 zfs13. System failed this morning and came back without pool - having lost a disk. This is how I found the system: pool: fatman state: FAULTED status: One or more devices could not be used because the label is missing or invalid. There are insufficient replicas for the pool to continue functioning. action: Destroy and re-create the pool from a backup source. see: http://www.sun.com/msg/ZFS-8000-5E scrub: none requested config: NAME STATE READ WRITE CKSUM fatman FAULTED 0 0 1 corrupted data raidz2 DEGRADED 0 0 6 da2 FAULTED 0 0 0 corrupted data ad4 ONLINE 0 0 0 ad6 ONLINE 0 0 0 ad20 ONLINE 0 0 0 ad22 ONLINE 0 0 0 ad17 ONLINE 0 0 0 da2 ONLINE 0 0 0 ad10 ONLINE 0 0 0 ad16 ONLINE 0 0 0 Initialy it complained that da3 had gone to da2 (da2 had failed and was no longer seen) I replaced the original da2 and bumped what was originaly da3 back up to da3 using the controllers ordering. [root@potjie /dev]# zpool status pool: fatman state: FAULTED status: One or more devices could not be used because the label is missing or invalid. There are insufficient replicas for the pool to continue functioning. action: Destroy and re-create the pool from a backup source. see: http://www.sun.com/msg/ZFS-8000-5E scrub: none requested config: NAME STATE READ WRITE CKSUM fatman FAULTED 0 0 1 corrupted data raidz2 ONLINE 0 0 6 da2 UNAVAIL 0 0 0 corrupted data ad4 ONLINE 0 0 0 ad6 ONLINE 0 0 0 ad20 ONLINE 0 0 0 ad22 ONLINE 0 0 0 ad17 ONLINE 0 0 0 da3 ONLINE 0 0 0 ad10 ONLINE 0 0 0 ad16 ONLINE 0 0 0 Issue looks very similar to this (JMR's issue) : http://freebsd.monkey.org/freebsd-fs/200902/msg00017.html I've tried the methods there without much result. Using JMR's patches/debugs to see what is going on, this is what I got: JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 ub_timestamp=1255246834 JMR: vdev_uberblock_load_done ub_txg=46475459 ub_timestamp=1255231780 JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 ub_timestamp=1255246834 JMR: vdev_uberblock_load_done ub_txg=46475458 ub_timestamp=1255231750 JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 ub_timestamp=1255246834 JMR: vdev_uberblock_load_done ub_txg=46481473 ub_timestamp=1255234263 JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 ub_timestamp=1255246834 JMR: vdev_uberblock_load_done ub_txg=46481472 ub_timestamp=1255234263 JMR: vdev_uberblock_load_done ubbest ub_txg=46488653 ub_timestamp=1255246834 But JMR's patch still doesn't let me import even with a decremented txg I then had a look around the drives using zdb and some dirty script: [root@potjie /dev]# ls /dev/ad* /dev/da2 /dev/da3 | awk '{print "echo "$1";zdb -l "$1" |grep txg"}' | sh /dev/ad10 txg=46488654 txg=46488654 txg=46488654 txg=46488654 /dev/ad16 txg=46408223 <- old TXGid ? txg=46408223 txg=46408223 txg=46408223 /dev/ad17 txg=46408223 <- old TXGid ? txg=46408223 txg=46408223 txg=46408223 /dev/ad18 (ssd) /dev/ad19 (spare drive, removed from pool some time ago) txg=0 create_txg=0 txg=0 create_txg=0 txg=0 create_txg=0 txg=0 create_txg=0 /dev/ad20 txg=46488654 txg=46488654 txg=46488654 txg=46488654 /dev/ad22 txg=46488654 txg=46488654 txg=46488654 txg=46488654 /dev/ad4 txg=46488654 txg=46488654 txg=46488654 txg=46488654 /dev/ad6 txg=46488654 txg=46488654 txg=46488654 txg=46488654 /dev/da2 < new drive replaced broken da2 /dev/da3 txg=46488654 txg=46488654 txg=46488654 txg=46488654 I did not see any checksums or other issues on ad16 and ad17 previously, and I do check regularly. Any thoughts on what to try next ? Regards, Alex