From owner-freebsd-fs@FreeBSD.ORG  Thu Mar  4 08:12:22 2010
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 11340106564A
	for <freebsd-fs@freebsd.org>; Thu,  4 Mar 2010 08:12:22 +0000 (UTC)
	(envelope-from spam_zfs@hessmann.de)
Received: from mail.vbcl.de (lists2.holidaycheck.de [82.135.108.108])
	by mx1.freebsd.org (Postfix) with ESMTP id 9907F8FC1A
	for <freebsd-fs@freebsd.org>; Thu,  4 Mar 2010 08:12:21 +0000 (UTC)
Received: from ppp-88-217-1-172.dynamic.mnet-online.de ([88.217.1.172]
	helo=[192.168.0.62])
	by mail.vbcl.de with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.69)
	(envelope-from <spam_zfs@hessmann.de>) id 1Nn5p3-0002ec-Py
	for freebsd-fs@freebsd.org; Thu, 04 Mar 2010 08:50:18 +0100
Message-Id: <15662C97-CCB2-480A-838A-22EFF2922210@hessmann.de>
From: =?ISO-8859-1?Q?Christian_He=DFmann?= <spam_zfs@hessmann.de>
To: freebsd-fs@freebsd.org
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v936)
Date: Thu, 4 Mar 2010 08:50:15 +0100
X-Mailer: Apple Mail (2.936)
X-Antivirus-Scanned: Clean
Subject: ZFS RAID: Disk fails while repacing another disk
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 04 Mar 2010 08:12:22 -0000

Hello guys,


first, I have to apologize to the people who've already read this on  
the freebsd Webforum or the Opensolaris ZFS mailinglist, but I just  
heard about this mailinglist from Bob in the Opensolaris ZFS ML and  
thought I'd give it another go here before doing anything drastic.

I have a ZFS pool comprised of two 3-disk RAIDs which I've recently  
moved from OS X to FreeBSD (8 stable).

One harddisk failed last weekend with lots of shouting, SMART messages  
and even a kernel panic.
I attached a new disk and started the replacement.
Unfortunately, about 20% into the replacement, a second disk in the  
same RAID showed signs of misbehaviour by giving me read errors. The  
resilvering did finish, though, and it left me with only three broken  
files according to zpool status:

[root@camelot /]# zpool status -v tank
  pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
scrub: resilver completed after 10h42m with 136 errors on Tue Mar  2  
07:55:05 2010
config:

        NAME           STATE     READ WRITE CKSUM
        tank           DEGRADED   137     0     0
          raidz1       ONLINE       0     0     0
            ad17p2     ONLINE       0     0     0
            ad18p2     ONLINE       0     0     0
            ad20p2     ONLINE       0     0     0
          raidz1       DEGRADED   326     0     0
            replacing  DEGRADED     0     0     0
              ad16p2   OFFLINE      2  169K     6
              ad4p2    ONLINE       0     0     0  839G resilvered
            ad14p2     ONLINE       0     0     0  5.33G resilvered
            ad15p2     ONLINE     418     0     0  5.33G resilvered

errors: Permanent errors have been detected in the following files:

        tank/DVD:<0x9cd>
        tank/DVD@20100222225100:/Memento.m4v
        tank/DVD@20100222225100:/Payback.m4v
        tank/DVD@20100222225100:/TheManWhoWasntThere.m4v

I have the feeling the problems on ad15p2 are related to a cable  
issue, since it doesn't have any SMART errors, is quite a new drive (3  
months old) and was IMHO sufficiently "burned in" by repeatedly  
filling it to the brim and checking the contents (via ZFS). So I'd  
like to switch off the server, replace the cable and do a scrub  
afterwards to make sure it doesn't produce additional errors.

Unfortunately, although it says the resilvering completed, I can't  
detach ad16p2 (the first faulted disk) from the system:

[root@camelot /]# zpool detach tank ad16p2
cannot detach ad16p2: no valid replicas

I know that it still says 'replacing', but I don't have any activity  
on the drives, so I have to assume that it either stopped and can't  
restart or it has finished but somehow doesn't know it.

To be honest, I don't know how to proceed now. It feels like my system  
is in a very unstable state right now, with a replacement not yet  
finished and errors on two drives in one RAID.Z1.

I deleted the files affected, but have about 20 snapshots of this  
filesystem and think these files are in most of them since they're  
quite old.

So, what should I do now? Delete all snapshots? Move all other files  
from this filesystem to a new filesystem and destroy the old  
filesystem? Try to export and import the pool? Is it even safe to  
reboot the machine right now?

So far, the answers I got are either: Booting is the last resort, be  
careful or (especially from the FreeBSD side): Yes, boot it and scrub  
it, that's the usual way we're doing things. I think what I'm looking  
for here is that no one shouts out: STOP, DON'T REBOOT!

Yes, I have and I will again copy the data to other disks (at least  
the important files, unfortunately I don't have 7TB of disks lying  
around) and I won't blame ZFS for doing what it's supposed to do, but  
it would be nice if I didn't have to start from scratch. Although then  
I would be a lot smarter and go for RAID.Z2, believe me...

Thank you.

Christian