Date: Mon, 25 May 2009 11:13:31 +0200 From: Thomas Backman <serenity@exscape.org> To: freebsd-current@freebsd.org Subject: Re: ZFS panic under extreme circumstances (2/3 disks corrupted) Message-ID: <D98FEABB-8B8A-48E6-B021-B05816B4C699@exscape.org> In-Reply-To: <4E6E325D-BB18-4478-BCFD-633D6F4CFD88@exscape.org> References: <4E6E325D-BB18-4478-BCFD-633D6F4CFD88@exscape.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On May 24, 2009, at 09:02 PM, Thomas Backman wrote:
> So, I was playing around with RAID-Z and self-healing...
Yet another follow-up to this.
It appears that all traces of errors vanish after a reboot. So, say
you have a dying disk; ZFS repairs the data for you, and you don't
notice (unless you check zpool status). Then you reboot, and there's
NO (easy?) way that I can tell to find out that something is wrong
with your hardware!
[root@clone ~]# zpool status test
pool: test
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the
errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: scrub completed after 0h1m with 0 errors on Mon May 25
11:01:22 2009
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
raidz1 ONLINE 0 0 0
da1 ONLINE 0 0 0
da2 ONLINE 0 0 1 64K repaired
da3 ONLINE 0 0 0
errors: No known data errors
----------- reboot -----------
[root@clone ~]# zpool status test
pool: test
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
raidz1 ONLINE 0 0 0
da1 ONLINE 0 0 0
da2 ONLINE 0 0 0
da3 ONLINE 0 0 0
errors: No known data errors
[root@clone ~]# zpool history -i test
# ... snip ...
# Below is the relevant output from the scrub that found the errors:
2009-05-25.11:00:21 [internal pool scrub txg:118] func=1 mintxg=0
maxtxg=118
2009-05-25.11:00:23 zpool scrub test
2009-05-25.11:01:22 [internal pool scrub done txg:120] complete=1
Nothing there to say that it found errors, right? If there is, it
should be a lot more clear. Also, root should receive automatic mails
when data corruption occurs IMHO.
[root@clone ~]# zpool scrub test
# Wait a while...
[root@clone ~]# zpool status test
pool: test
state: ONLINE
scrub: scrub completed after 0h1m with 0 errors on Mon May 25
11:06:05 2009
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
raidz1 ONLINE 0 0 0
da1 ONLINE 0 0 0
da2 ONLINE 0 0 0
da3 ONLINE 0 0 0
errors: No known data errors
I'm guessing this is the case in OpenSolaris as well...? In any case,
it's BAD. Unless you keep checking zpool status over and over, you
could have a disk "failing silently" - which defeats one of the major
purposes of ZFS! Sure, auto-healing is nice, but it should tell you
that it's happening, so that you can prepare to replace a disk (i.e.
order a new one BEFORE it crasches bigtime).
Regards,
Thomas
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?D98FEABB-8B8A-48E6-B021-B05816B4C699>
