From owner-freebsd-fs@FreeBSD.ORG  Mon Jun 22 00:46:48 2015
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@nevdull.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 28457572
 for <freebsd-fs@nevdull.freebsd.org>; Mon, 22 Jun 2015 00:46:48 +0000 (UTC)
 (envelope-from quartz@sneakertech.com)
Received: from douhisi.pair.com (unknown [IPv6:2607:f440::d144:5b3])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 009C3F57
 for <freebsd-fs@freebsd.org>; Mon, 22 Jun 2015 00:46:47 +0000 (UTC)
 (envelope-from quartz@sneakertech.com)
Received: from [10.2.2.1] (pool-173-48-121-235.bstnma.fios.verizon.net
 [173.48.121.235])
 by douhisi.pair.com (Postfix) with ESMTPSA id 9371B3F715;
 Sun, 21 Jun 2015 20:28:27 -0400 (EDT)
Message-ID: <558756AB.405@sneakertech.com>
Date: Sun, 21 Jun 2015 20:28:27 -0400
From: Quartz <quartz@sneakertech.com>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6;
 rv:10.0.2) Gecko/20120216 Thunderbird/10.0.2
MIME-Version: 1.0
To: Willem Jan Withagen <wjw@digiware.nl>
CC: freebsd-fs@freebsd.org
Subject: Re: This diskfailure should not panic a system, but just disconnect
 disk from ZFS
References: <5585767B.4000206@digiware.nl> <558590BD.40603@isletech.net>
 <5586C396.9010100@digiware.nl> <55871F4C.5010103@sneakertech.com>
 <55874772.4090607@digiware.nl>
In-Reply-To: <55874772.4090607@digiware.nl>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 22 Jun 2015 00:46:48 -0000

> But especially the hung disk during reading

Writing is the issue moreso. At least, if you set your failmode to 
'continue' ZFS will to try to honor reads as long as it's able, but 
writes will block. (In practice though it'll usually only give you an 
extra minute or so before everything locks up).


> We'll the pool did not die, (at least not IMHO)

Sorry, that's bad wording on my part. What I meant was that IO to the 
pool died.


>just one disk stopt
> working....

It would have to be 3+ disks in your case, with a raidz2.


> I guess that if I like to live dangerously, I could set enabled to 0,
> and run the risk... ??

Well, that will just disable the auto panic. If the IO disappeared into 
a black hole due to a hardware issue the machine will just stay hung 
forever until you manually press the reset button on the front. ZFS will 
prevent any major corruption of the pool so it's not really "dangerous". 
(Outside of further hardware failures).


> But still I would expect the volume to become degraded if one of the
> disks goes into the error state?

If *one* of the disks drops out, yes. If a second drops out later, also 
yes, because ZFS can still handle IO to the pool. But as soon as that 
third disk drops out in a way that locks up IO, ZFS freezes.

For reference, I had a raidz2 test case with 6 drives. I could yank the 
sata cable off two of the drives and the pool would be marked as 
degraded, but as soon as I yanked that third drive everything froze. 
This is why I heavily suspect in your case that your controller or PSU 
is failing and dropping multiple disks at a time. The fact that the log 
reports da0 is probably just because that was the last disk ZFS tried to 
fall back on when they all dropped out at once.

Ideally, the system *should* handle this situation gracefully, but the 
reality is that it doesn't. If the last disk fails in a way that hangs 
IO, it takes the whole machine with it. No system configuration change 
can prevent this, not with how things are currently designed.


> This article is mainly about forcasting disk failure based on SMART
> numbers....

> I was just looking at the counters to see if the disk had logged just
> any fact of info/warning/error

What Google found out is that a lot of disks *don't* report errors or 
warnings before experiencing problems. In other words, SMART saying "all 
good" doesn't really mean much in practice, so you shouldn't really rely 
on it for diagnostics.