From owner-freebsd-fs@FreeBSD.ORG  Mon Jun 22 01:21:53 2015
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@nevdull.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 74001F09
 for <freebsd-fs@nevdull.freebsd.org>; Mon, 22 Jun 2015 01:21:53 +0000 (UTC)
 (envelope-from quartz@sneakertech.com)
Received: from douhisi.pair.com (douhisi.pair.com [209.68.5.179])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 51A2A18E2
 for <freebsd-fs@freebsd.org>; Mon, 22 Jun 2015 01:21:53 +0000 (UTC)
 (envelope-from quartz@sneakertech.com)
Received: from [10.2.2.1] (pool-173-48-121-235.bstnma.fios.verizon.net
 [173.48.121.235])
 by douhisi.pair.com (Postfix) with ESMTPSA id F2B173F71B;
 Sun, 21 Jun 2015 16:32:12 -0400 (EDT)
Message-ID: <55871F4C.5010103@sneakertech.com>
Date: Sun, 21 Jun 2015 16:32:12 -0400
From: Quartz <quartz@sneakertech.com>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6;
 rv:10.0.2) Gecko/20120216 Thunderbird/10.0.2
MIME-Version: 1.0
To: Willem Jan Withagen <wjw@digiware.nl>
CC: freebsd-fs@freebsd.org
Subject: Re: This diskfailure should not panic a system, but just disconnect
 disk from ZFS
References: <5585767B.4000206@digiware.nl> <558590BD.40603@isletech.net>
 <5586C396.9010100@digiware.nl>
In-Reply-To: <5586C396.9010100@digiware.nl>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 22 Jun 2015 01:21:53 -0000

> Or do I have to high hopes of ZFS?
> And is a hung disk a 'catastrophic pool failure'?

Yes to both.

I encountered this exact same issue a couple years ago (and complained 
about it to this list as well, although I didn't get a complete answer 
at the time. I can provide links to the conversation if interested).

Basically, the heart of the issue is the way the kernel/drivers/ZFS 
deals with IO and DMA. There's currently no way to tell what's going on 
with the disks and what outstanding IO to the pool can be dropped or 
ignored. As-currently-designed there's no safe way to just kick out the 
pool and keep going, so the only options are to wait, panic, or wait and 
then panic. Fixing this would require a major rewrite of a lot of code, 
which isn't going to happen any time soon. The failmode setting and 
deadman timer were implemented as a bandage to prevent the system from 
hanging forever.

See this page for more info:
http://comments.gmane.org/gmane.os.illumos.zfs/61


> All failmode settings result in a seriously handicapped system...

Yes. Again, this is a design issue/flaw with how DMA works. There's no 
real way to continue on gracefully when a pool completely dies due to 
hung IO.

We're all pretty much stuck with this problem, at least for quite a while.


> Is waiting only meant to wait a limited time? And then panic anyways?

By default yes. However, if you know that on your system the issue will 
eventually resolve itself given several hours (and you want to wait that 
long) you can change the deadman timeout or disable it completely. Look 
at "vfs.zfs.deadman_enabled" and "vfs.zfs.deadman_synctime".