From owner-freebsd-current@FreeBSD.ORG  Mon May 25 09:13:45 2009
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 67C021065677
	for <freebsd-current@freebsd.org>; Mon, 25 May 2009 09:13:45 +0000 (UTC)
	(envelope-from serenity@exscape.org)
Received: from ch-smtp01.sth.basefarm.net (ch-smtp01.sth.basefarm.net
	[80.76.149.212])
	by mx1.freebsd.org (Postfix) with ESMTP id EED178FC1C
	for <freebsd-current@freebsd.org>; Mon, 25 May 2009 09:13:44 +0000 (UTC)
	(envelope-from serenity@exscape.org)
Received: from c83-253-252-234.bredband.comhem.se ([83.253.252.234]:46896
	helo=mx.exscape.org)
	by ch-smtp01.sth.basefarm.net with esmtp (Exim 4.69)
	(envelope-from <serenity@exscape.org>) id 1M8WFT-0000jF-4n
	for freebsd-current@freebsd.org; Mon, 25 May 2009 11:13:37 +0200
Received: from [192.168.1.5] (macbookpro [192.168.1.5])
	(using TLSv1 with cipher AES128-SHA (128/128 bits))
	(No client certificate requested)
	by mx.exscape.org (Postfix) with ESMTPSA id 6A3DDEEC9C
	for <freebsd-current@freebsd.org>;
	Mon, 25 May 2009 11:13:31 +0200 (CEST)
Message-Id: <D98FEABB-8B8A-48E6-B021-B05816B4C699@exscape.org>
From: Thomas Backman <serenity@exscape.org>
To: freebsd-current@freebsd.org
In-Reply-To: <4E6E325D-BB18-4478-BCFD-633D6F4CFD88@exscape.org>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v935.3)
Date: Mon, 25 May 2009 11:13:31 +0200
References: <4E6E325D-BB18-4478-BCFD-633D6F4CFD88@exscape.org>
X-Mailer: Apple Mail (2.935.3)
X-Originating-IP: 83.253.252.234
X-Scan-Result: No virus found in message 1M8WFT-0000jF-4n.
X-Scan-Signature: ch-smtp01.sth.basefarm.net 1M8WFT-0000jF-4n
	2002a30e9de625c2ce595e851a1bd1ad
Subject: Re: ZFS panic under extreme circumstances (2/3 disks corrupted)
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 25 May 2009 09:13:45 -0000

On May 24, 2009, at 09:02 PM, Thomas Backman wrote:

> So, I was playing around with RAID-Z and self-healing...
Yet another follow-up to this.
It appears that all traces of errors vanish after a reboot. So, say  
you have a dying disk; ZFS repairs the data for you, and you don't  
notice (unless you check zpool status). Then you reboot, and there's  
NO (easy?) way that I can tell to find out that something is wrong  
with your hardware!

[root@clone ~]# zpool status test
   pool: test
  state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the  
errors
	using 'zpool clear' or replace the device with 'zpool replace'.
    see: http://www.sun.com/msg/ZFS-8000-9P
  scrub: scrub completed after 0h1m with 0 errors on Mon May 25  
11:01:22 2009
config:

	NAME        STATE     READ WRITE CKSUM
	test        ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    da1     ONLINE       0     0     0
	    da2     ONLINE       0     0     1  64K repaired
	    da3     ONLINE       0     0     0

errors: No known data errors

----------- reboot -----------

[root@clone ~]# zpool status test
   pool: test
  state: ONLINE
  scrub: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	test        ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    da1     ONLINE       0     0     0
	    da2     ONLINE       0     0     0
	    da3     ONLINE       0     0     0

errors: No known data errors

[root@clone ~]# zpool history -i test
# ... snip ...
# Below is the relevant output from the scrub that found the errors:
2009-05-25.11:00:21 [internal pool scrub txg:118] func=1 mintxg=0  
maxtxg=118
2009-05-25.11:00:23 zpool scrub test
2009-05-25.11:01:22 [internal pool scrub done txg:120] complete=1

Nothing there to say that it found errors, right? If there is, it  
should be a lot more clear. Also, root should receive automatic mails  
when data corruption occurs IMHO.

[root@clone ~]# zpool scrub test
# Wait a while...
[root@clone ~]# zpool status test
   pool: test
  state: ONLINE
  scrub: scrub completed after 0h1m with 0 errors on Mon May 25  
11:06:05 2009
config:

	NAME        STATE     READ WRITE CKSUM
	test        ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    da1     ONLINE       0     0     0
	    da2     ONLINE       0     0     0
	    da3     ONLINE       0     0     0

errors: No known data errors


I'm guessing this is the case in OpenSolaris as well...? In any case,  
it's BAD. Unless you keep checking zpool status over and over, you  
could have a disk "failing silently" - which defeats one of the major  
purposes of ZFS! Sure, auto-healing is nice, but it should tell you  
that it's happening, so that you can prepare to replace a disk (i.e.  
order a new one BEFORE it crasches bigtime).

Regards,
Thomas