From owner-freebsd-stable@FreeBSD.ORG  Thu May  7 14:05:50 2015
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id C7075E4F
 for <freebsd-stable@freebsd.org>; Thu,  7 May 2015 14:05:50 +0000 (UTC)
Received: from gromit.dlib.vt.edu (gromit.dlib.vt.edu [128.173.126.120])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "gromit.dlib.vt.edu",
 Issuer "Chumby Certificate Authority" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 9BE4F1A79
 for <freebsd-stable@freebsd.org>; Thu,  7 May 2015 14:05:50 +0000 (UTC)
Received: from pmather.lib.vt.edu (pmather.lib.vt.edu [128.173.126.193])
 (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by gromit.dlib.vt.edu (Postfix) with ESMTPSA id 48178D35;
 Thu,  7 May 2015 09:56:12 -0400 (EDT)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2098\))
Subject: Re: zfs, cam sticking on failed disk
From: Paul Mather <paul@gromit.dlib.vt.edu>
In-Reply-To: <554B53E8.4000508@multiplay.co.uk>
Date: Thu, 7 May 2015 09:56:11 -0400
Cc: Slawa Olhovchenkov <slw@zxy.spb.ru>,
 freebsd-stable@freebsd.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <51E7F693-AA33-4BDD-8CEA-769D8EC20D36@gromit.dlib.vt.edu>
References: <20150507080749.GB1394@zxy.spb.ru>
 <554B2547.1090307@multiplay.co.uk> <20150507095048.GC1394@zxy.spb.ru>
 <554B40B6.6060902@multiplay.co.uk> <20150507104655.GT62239@zxy.spb.ru>
 <554B53E8.4000508@multiplay.co.uk>
To: Steven Hartland <killing@multiplay.co.uk>
X-Mailer: Apple Mail (2.2098)
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable/>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 07 May 2015 14:05:50 -0000

On May 7, 2015, at 8:00 AM, Steven Hartland <killing@multiplay.co.uk> =
wrote:

> On 07/05/2015 11:46, Slawa Olhovchenkov wrote:
>> On Thu, May 07, 2015 at 11:38:46AM +0100, Steven Hartland wrote:
>>=20
>>>>>> How I can cancel this 24 requst?
>>>>>> Why this requests don't timeout (3 hours already)?
>>>>>> How I can forced detach this disk? (I am lready try `camcontrol =
reset`, `camconrol rescan`).
>>>>>> Why ZFS (or geom) don't timeout on request and don't rerouted to =
da18?
>>>>>>=20
>>>>> If they are in mirrors, in theory you can just pull the disk, isci =
will
>>>>> report to cam and cam will report to ZFS which should all recover.
>>>> Yes, zmirror with da18.
>>>> I am surprise that ZFS don't use da18. All zpool fully stuck.
>>> A single low level request can only be handled by one device, if =
that
>>> device returns an error then ZFS will use the other device, but not =
until.
>> Why next requests don't routed to da18?
>> Current request stuck on da19 (unlikely, but understund), but why
>> stuck all pool?
>=20
> Its still waiting for the request from the failed device to complete. =
As far as ZFS currently knows there is nothing wrong with the device as =
its had no failures.


Maybe related to this, but if the drive stalls indefinitely, is it what =
leads to the "panic: I/O to pool 'poolname' appears to be hung on vdev =
guid GUID-ID at '/dev/somedevice'."?

I have a 6-disk RAIDZ2 pool that is used for nightly rsync backups from =
various systems.  I believe one of the drives is a bit temperamental.  =
Very occasionally, I discover the backup has failed and the machine =
actually paniced because of this drive, with a panic message like the =
above.  The panic backtrace includes references to vdev_deadman, which =
sounds like some sort of dead man's switch/watchdog.

It's a bit counter-intuitive that a single drive wandering off into =
la-la land can not only cause an entire ZFS pool to wedge, but, worse =
still, panic the whole machine.

If I'm understanding this thread correctly, part of the problem is that =
an I/O never completing is not the same as a failure to ZFS, and hence =
ZFS can't call upon various resources in the pool and mechanisms at its =
disposal to correct for that.  Is that accurate?

I would think that never-ending I/O requests would be a type of failure =
that ZFS could sustain.  It seems from the "hung on vdev" panic that it =
does detect this situation, though the resolution (panic) is not ideal. =
:-)

Cheers,

Paul.=