From owner-freebsd-current@FreeBSD.ORG  Mon May 25 16:12:58 2009
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8D6281065677
	for <freebsd-current@freebsd.org>; Mon, 25 May 2009 16:12:58 +0000 (UTC)
	(envelope-from serenity@exscape.org)
Received: from ch-smtp01.sth.basefarm.net (ch-smtp01.sth.basefarm.net
	[80.76.149.212])
	by mx1.freebsd.org (Postfix) with ESMTP id 1F9FC8FC1B
	for <freebsd-current@freebsd.org>; Mon, 25 May 2009 16:12:58 +0000 (UTC)
	(envelope-from serenity@exscape.org)
Received: from c83-253-252-234.bredband.comhem.se ([83.253.252.234]:44901
	helo=mx.exscape.org)
	by ch-smtp01.sth.basefarm.net with esmtp (Exim 4.69)
	(envelope-from <serenity@exscape.org>)
	id 1M8cmW-00083z-4y; Mon, 25 May 2009 18:12:10 +0200
Received: from [192.168.1.5] (macbookpro [192.168.1.5])
	(using TLSv1 with cipher AES128-SHA (128/128 bits))
	(No client certificate requested)
	by mx.exscape.org (Postfix) with ESMTPSA id 8FD7CEECCD;
	Mon, 25 May 2009 18:12:05 +0200 (CEST)
Message-Id: <D817D098-9C36-4B72-9DCB-027CE8A7C564@exscape.org>
From: Thomas Backman <serenity@exscape.org>
To: Freddie Cash <fjwcash@gmail.com>
In-Reply-To: <b269bc570905250839r54a0f58fo5474e9e219a222ca@mail.gmail.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v935.3)
Date: Mon, 25 May 2009 18:12:05 +0200
References: <4E6E325D-BB18-4478-BCFD-633D6F4CFD88@exscape.org>
	<D98FEABB-8B8A-48E6-B021-B05816B4C699@exscape.org>
	<b269bc570905250839r54a0f58fo5474e9e219a222ca@mail.gmail.com>
X-Mailer: Apple Mail (2.935.3)
X-Originating-IP: 83.253.252.234
X-Scan-Result: No virus found in message 1M8cmW-00083z-4y.
X-Scan-Signature: ch-smtp01.sth.basefarm.net 1M8cmW-00083z-4y
	712022565f8b1d9e5f17bdeb8d67e263
Cc: freebsd-current@freebsd.org
Subject: Re: ZFS panic under extreme circumstances (2/3 disks corrupted)
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 25 May 2009 16:12:59 -0000


On May 25, 2009, at 05:39 PM, Freddie Cash wrote:

> On Mon, May 25, 2009 at 2:13 AM, Thomas Backman  
> <serenity@exscape.org> wrote:
>> On May 24, 2009, at 09:02 PM, Thomas Backman wrote:
>>
>>> So, I was playing around with RAID-Z and self-healing...
>>
>> Yet another follow-up to this.
>> It appears that all traces of errors vanish after a reboot. So, say  
>> you have
>> a dying disk; ZFS repairs the data for you, and you don't notice  
>> (unless you
>> check zpool status). Then you reboot, and there's NO (easy?) way  
>> that I can
>> tell to find out that something is wrong with your hardware!
>
> On our storage server that was initially configured using 1 large
> 24-drive raidz2 vdev (don't do that, by the way), we had 1 drive go
> south.  "zpool status" was full of errors.  And the error counts
> survived reboots.  Either that, or the drive was so bad that the error
> counts started increasing right away after a boot.  After a week of
> fighting with it to get the new drive to resilver and get added to the
> vdev, we nuked it and re-created it using 3 raidz2 vdevs each
> comprised of 8 drives.
>
> (Un)fortunately, that was the only failure we've had so far, so can't
> really confirm/deny the "error counts reset after reboot".

Was this on FreeBSD?

I have another unfortunate thing to note regarding this: after a  
reboot, it's even impossible to tell *which disk* has gone bad, even  
if the pool is "uncleared" but otherwise "healed". It simply says that  
a device has failed, with no clue as to which one, since they're all  
"ONLINE"!

Regards,
Thomas