Date: Sun, 21 Jun 2015 20:28:27 -0400 From: Quartz <quartz@sneakertech.com> To: Willem Jan Withagen <wjw@digiware.nl> Cc: freebsd-fs@freebsd.org Subject: Re: This diskfailure should not panic a system, but just disconnect disk from ZFS Message-ID: <558756AB.405@sneakertech.com> In-Reply-To: <55874772.4090607@digiware.nl> References: <5585767B.4000206@digiware.nl> <558590BD.40603@isletech.net> <5586C396.9010100@digiware.nl> <55871F4C.5010103@sneakertech.com> <55874772.4090607@digiware.nl>
next in thread | previous in thread | raw e-mail | index | archive | help
> But especially the hung disk during reading Writing is the issue moreso. At least, if you set your failmode to 'continue' ZFS will to try to honor reads as long as it's able, but writes will block. (In practice though it'll usually only give you an extra minute or so before everything locks up). > We'll the pool did not die, (at least not IMHO) Sorry, that's bad wording on my part. What I meant was that IO to the pool died. >just one disk stopt > working.... It would have to be 3+ disks in your case, with a raidz2. > I guess that if I like to live dangerously, I could set enabled to 0, > and run the risk... ?? Well, that will just disable the auto panic. If the IO disappeared into a black hole due to a hardware issue the machine will just stay hung forever until you manually press the reset button on the front. ZFS will prevent any major corruption of the pool so it's not really "dangerous". (Outside of further hardware failures). > But still I would expect the volume to become degraded if one of the > disks goes into the error state? If *one* of the disks drops out, yes. If a second drops out later, also yes, because ZFS can still handle IO to the pool. But as soon as that third disk drops out in a way that locks up IO, ZFS freezes. For reference, I had a raidz2 test case with 6 drives. I could yank the sata cable off two of the drives and the pool would be marked as degraded, but as soon as I yanked that third drive everything froze. This is why I heavily suspect in your case that your controller or PSU is failing and dropping multiple disks at a time. The fact that the log reports da0 is probably just because that was the last disk ZFS tried to fall back on when they all dropped out at once. Ideally, the system *should* handle this situation gracefully, but the reality is that it doesn't. If the last disk fails in a way that hangs IO, it takes the whole machine with it. No system configuration change can prevent this, not with how things are currently designed. > This article is mainly about forcasting disk failure based on SMART > numbers.... > I was just looking at the counters to see if the disk had logged just > any fact of info/warning/error What Google found out is that a lot of disks *don't* report errors or warnings before experiencing problems. In other words, SMART saying "all good" doesn't really mean much in practice, so you shouldn't really rely on it for diagnostics.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?558756AB.405>