From owner-freebsd-fs@FreeBSD.ORG Mon Jun 22 00:46:48 2015 Return-Path: Delivered-To: freebsd-fs@nevdull.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 28457572 for ; Mon, 22 Jun 2015 00:46:48 +0000 (UTC) (envelope-from quartz@sneakertech.com) Received: from douhisi.pair.com (unknown [IPv6:2607:f440::d144:5b3]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 009C3F57 for ; Mon, 22 Jun 2015 00:46:47 +0000 (UTC) (envelope-from quartz@sneakertech.com) Received: from [10.2.2.1] (pool-173-48-121-235.bstnma.fios.verizon.net [173.48.121.235]) by douhisi.pair.com (Postfix) with ESMTPSA id 9371B3F715; Sun, 21 Jun 2015 20:28:27 -0400 (EDT) Message-ID: <558756AB.405@sneakertech.com> Date: Sun, 21 Jun 2015 20:28:27 -0400 From: Quartz User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:10.0.2) Gecko/20120216 Thunderbird/10.0.2 MIME-Version: 1.0 To: Willem Jan Withagen CC: freebsd-fs@freebsd.org Subject: Re: This diskfailure should not panic a system, but just disconnect disk from ZFS References: <5585767B.4000206@digiware.nl> <558590BD.40603@isletech.net> <5586C396.9010100@digiware.nl> <55871F4C.5010103@sneakertech.com> <55874772.4090607@digiware.nl> In-Reply-To: <55874772.4090607@digiware.nl> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 22 Jun 2015 00:46:48 -0000 > But especially the hung disk during reading Writing is the issue moreso. At least, if you set your failmode to 'continue' ZFS will to try to honor reads as long as it's able, but writes will block. (In practice though it'll usually only give you an extra minute or so before everything locks up). > We'll the pool did not die, (at least not IMHO) Sorry, that's bad wording on my part. What I meant was that IO to the pool died. >just one disk stopt > working.... It would have to be 3+ disks in your case, with a raidz2. > I guess that if I like to live dangerously, I could set enabled to 0, > and run the risk... ?? Well, that will just disable the auto panic. If the IO disappeared into a black hole due to a hardware issue the machine will just stay hung forever until you manually press the reset button on the front. ZFS will prevent any major corruption of the pool so it's not really "dangerous". (Outside of further hardware failures). > But still I would expect the volume to become degraded if one of the > disks goes into the error state? If *one* of the disks drops out, yes. If a second drops out later, also yes, because ZFS can still handle IO to the pool. But as soon as that third disk drops out in a way that locks up IO, ZFS freezes. For reference, I had a raidz2 test case with 6 drives. I could yank the sata cable off two of the drives and the pool would be marked as degraded, but as soon as I yanked that third drive everything froze. This is why I heavily suspect in your case that your controller or PSU is failing and dropping multiple disks at a time. The fact that the log reports da0 is probably just because that was the last disk ZFS tried to fall back on when they all dropped out at once. Ideally, the system *should* handle this situation gracefully, but the reality is that it doesn't. If the last disk fails in a way that hangs IO, it takes the whole machine with it. No system configuration change can prevent this, not with how things are currently designed. > This article is mainly about forcasting disk failure based on SMART > numbers.... > I was just looking at the counters to see if the disk had logged just > any fact of info/warning/error What Google found out is that a lot of disks *don't* report errors or warnings before experiencing problems. In other words, SMART saying "all good" doesn't really mean much in practice, so you shouldn't really rely on it for diagnostics.