From owner-freebsd-current@FreeBSD.ORG Mon May 25 09:13:45 2009 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 67C021065677 for ; Mon, 25 May 2009 09:13:45 +0000 (UTC) (envelope-from serenity@exscape.org) Received: from ch-smtp01.sth.basefarm.net (ch-smtp01.sth.basefarm.net [80.76.149.212]) by mx1.freebsd.org (Postfix) with ESMTP id EED178FC1C for ; Mon, 25 May 2009 09:13:44 +0000 (UTC) (envelope-from serenity@exscape.org) Received: from c83-253-252-234.bredband.comhem.se ([83.253.252.234]:46896 helo=mx.exscape.org) by ch-smtp01.sth.basefarm.net with esmtp (Exim 4.69) (envelope-from ) id 1M8WFT-0000jF-4n for freebsd-current@freebsd.org; Mon, 25 May 2009 11:13:37 +0200 Received: from [192.168.1.5] (macbookpro [192.168.1.5]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by mx.exscape.org (Postfix) with ESMTPSA id 6A3DDEEC9C for ; Mon, 25 May 2009 11:13:31 +0200 (CEST) Message-Id: From: Thomas Backman To: freebsd-current@freebsd.org In-Reply-To: <4E6E325D-BB18-4478-BCFD-633D6F4CFD88@exscape.org> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v935.3) Date: Mon, 25 May 2009 11:13:31 +0200 References: <4E6E325D-BB18-4478-BCFD-633D6F4CFD88@exscape.org> X-Mailer: Apple Mail (2.935.3) X-Originating-IP: 83.253.252.234 X-Scan-Result: No virus found in message 1M8WFT-0000jF-4n. X-Scan-Signature: ch-smtp01.sth.basefarm.net 1M8WFT-0000jF-4n 2002a30e9de625c2ce595e851a1bd1ad Subject: Re: ZFS panic under extreme circumstances (2/3 disks corrupted) X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 25 May 2009 09:13:45 -0000 On May 24, 2009, at 09:02 PM, Thomas Backman wrote: > So, I was playing around with RAID-Z and self-healing... Yet another follow-up to this. It appears that all traces of errors vanish after a reboot. So, say you have a dying disk; ZFS repairs the data for you, and you don't notice (unless you check zpool status). Then you reboot, and there's NO (easy?) way that I can tell to find out that something is wrong with your hardware! [root@clone ~]# zpool status test pool: test state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed after 0h1m with 0 errors on Mon May 25 11:01:22 2009 config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 raidz1 ONLINE 0 0 0 da1 ONLINE 0 0 0 da2 ONLINE 0 0 1 64K repaired da3 ONLINE 0 0 0 errors: No known data errors ----------- reboot ----------- [root@clone ~]# zpool status test pool: test state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 raidz1 ONLINE 0 0 0 da1 ONLINE 0 0 0 da2 ONLINE 0 0 0 da3 ONLINE 0 0 0 errors: No known data errors [root@clone ~]# zpool history -i test # ... snip ... # Below is the relevant output from the scrub that found the errors: 2009-05-25.11:00:21 [internal pool scrub txg:118] func=1 mintxg=0 maxtxg=118 2009-05-25.11:00:23 zpool scrub test 2009-05-25.11:01:22 [internal pool scrub done txg:120] complete=1 Nothing there to say that it found errors, right? If there is, it should be a lot more clear. Also, root should receive automatic mails when data corruption occurs IMHO. [root@clone ~]# zpool scrub test # Wait a while... [root@clone ~]# zpool status test pool: test state: ONLINE scrub: scrub completed after 0h1m with 0 errors on Mon May 25 11:06:05 2009 config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 raidz1 ONLINE 0 0 0 da1 ONLINE 0 0 0 da2 ONLINE 0 0 0 da3 ONLINE 0 0 0 errors: No known data errors I'm guessing this is the case in OpenSolaris as well...? In any case, it's BAD. Unless you keep checking zpool status over and over, you could have a disk "failing silently" - which defeats one of the major purposes of ZFS! Sure, auto-healing is nice, but it should tell you that it's happening, so that you can prepare to replace a disk (i.e. order a new one BEFORE it crasches bigtime). Regards, Thomas