Date: Mon, 22 Jun 2015 14:30:04 +0200 From: Willem Jan Withagen <wjw@digiware.nl> To: Quartz <quartz@sneakertech.com>, Michelle Sullivan <michelle@sorbs.net> Cc: fs@freebsd.org Subject: Re: This diskfailure should not panic a system, but just disconnect disk from ZFS Message-ID: <5587FFCC.3080100@digiware.nl> In-Reply-To: <55877393.3040704@sneakertech.com> References: <5585767B.4000206@digiware.nl> <5587236A.6020404@sneakertech.com> <558769B5.601@sorbs.net> <55877393.3040704@sneakertech.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 22/06/2015 04:31, Quartz wrote: >>> You have a raidz2, which means THREE disks need to go down before the >>> pool is unwritable. The problem is most likely your controller or >>> power supply, not your disks. >>> >> Never make such assumptions... >> >> I have worked in a professional environment where 9 of 12 disks failed >> within 24 hours of each other.... > > Right... but if that was his problem there should be some logs of the > other drives going down first, and typically ZFS would correctly mark > the pool as degraded (at least, it would in my testing). The fact that > ZFS didn't get a chance to log anything and the pool came back up > healthy leads me to believe the controller went south, taking several > disks with it all at once and totally borking all IO. (Either that or > what Tom Curry mentioned about the Arc issue, which I wasn't previously > aware of). > > Of course, if it issue isn't repeatable then who knows.... I do not think it was a full out failure, but just one transaction that got hit by an alpha-particle... Well, remember that the hung-diagnostics timeout is 1000 sec. In the time-span before the panic nothing else was logged about disks/controllers/etc... not functioning.. Only the few secs before the panic ctl/iSCSI and the network interface started complaining that the was a memory shortage and the networkinterafce started dumping packets.... But all that was logged really nicely in syslog. So I think that in the 1000sec it took for the deadman switch to trigger, the zpool just functioned as was expected.... And the hardware somewhere lost one transaction. So I'll be crossing my fingers, and we'll see when/what/where the next crash in going to occur. And work from there.... --WjW
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5587FFCC.3080100>