From owner-freebsd-stable@FreeBSD.ORG  Thu May  7 12:44:18 2015
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id D8498649
 for <freebsd-stable@freebsd.org>; Thu,  7 May 2015 12:44:18 +0000 (UTC)
Received: from zxy.spb.ru (zxy.spb.ru [195.70.199.98])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 8B64410AC
 for <freebsd-stable@freebsd.org>; Thu,  7 May 2015 12:44:18 +0000 (UTC)
Received: from slw by zxy.spb.ru with local (Exim 4.84 (FreeBSD))
 (envelope-from <slw@zxy.spb.ru>)
 id 1YqLA0-0002fu-GA; Thu, 07 May 2015 15:44:16 +0300
Date: Thu, 7 May 2015 15:44:16 +0300
From: Slawa Olhovchenkov <slw@zxy.spb.ru>
To: Steven Hartland <killing@multiplay.co.uk>
Cc: freebsd-stable@freebsd.org
Subject: Re: zfs, cam sticking on failed disk
Message-ID: <20150507124416.GD1394@zxy.spb.ru>
References: <20150507080749.GB1394@zxy.spb.ru>
 <554B2547.1090307@multiplay.co.uk>
 <20150507095048.GC1394@zxy.spb.ru>
 <554B40B6.6060902@multiplay.co.uk>
 <20150507104655.GT62239@zxy.spb.ru>
 <554B53E8.4000508@multiplay.co.uk>
 <20150507120508.GX62239@zxy.spb.ru>
 <554B5BF9.8020709@multiplay.co.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <554B5BF9.8020709@multiplay.co.uk>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-SA-Exim-Connect-IP: <locally generated>
X-SA-Exim-Mail-From: slw@zxy.spb.ru
X-SA-Exim-Scanned: No (on zxy.spb.ru); SAEximRunCond expanded to false
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable/>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 07 May 2015 12:44:18 -0000

On Thu, May 07, 2015 at 01:35:05PM +0100, Steven Hartland wrote:

> 
> 
> On 07/05/2015 13:05, Slawa Olhovchenkov wrote:
> > On Thu, May 07, 2015 at 01:00:40PM +0100, Steven Hartland wrote:
> >
> >>
> >> On 07/05/2015 11:46, Slawa Olhovchenkov wrote:
> >>> On Thu, May 07, 2015 at 11:38:46AM +0100, Steven Hartland wrote:
> >>>
> >>>>>>> How I can cancel this 24 requst?
> >>>>>>> Why this requests don't timeout (3 hours already)?
> >>>>>>> How I can forced detach this disk? (I am lready try `camcontrol reset`, `camconrol rescan`).
> >>>>>>> Why ZFS (or geom) don't timeout on request and don't rerouted to da18?
> >>>>>>>
> >>>>>> If they are in mirrors, in theory you can just pull the disk, isci will
> >>>>>> report to cam and cam will report to ZFS which should all recover.
> >>>>> Yes, zmirror with da18.
> >>>>> I am surprise that ZFS don't use da18. All zpool fully stuck.
> >>>> A single low level request can only be handled by one device, if that
> >>>> device returns an error then ZFS will use the other device, but not until.
> >>> Why next requests don't routed to da18?
> >>> Current request stuck on da19 (unlikely, but understund), but why
> >>> stuck all pool?
> >> Its still waiting for the request from the failed device to complete. As
> >> far as ZFS currently knows there is nothing wrong with the device as its
> >> had no failures.
> > Can you explain some more?
> > One requst waiting, understand.
> > I am do next request. Some information need from vdev with failed
> > disk. Failed disk more busy (queue long), why don't routed to mirror
> > disk? Or, for metadata, to less busy vdev?
> As no error has been reported to ZFS, due to the stalled IO, there is no 
> failed vdev.

I see that device isn't failed (for both OS and ZFS).
I am don't talk 'failed vdev'. I am talk 'busy vdev' or 'busy device'.

> Yes in theory new requests should go to the other vdev, but there could 
> be some dependency issues preventing that such as a syncing TXG.

Currenly this pool must not have write activity (from application).
What about go to the other (mirror) device in the same vdev?
Same dependency?