Date: Thu, 07 May 2015 13:46:40 +0100 From: Steven Hartland <killing@multiplay.co.uk> To: Slawa Olhovchenkov <slw@zxy.spb.ru> Cc: freebsd-stable@freebsd.org Subject: Re: zfs, cam sticking on failed disk Message-ID: <554B5EB0.1080208@multiplay.co.uk> In-Reply-To: <20150507124416.GD1394@zxy.spb.ru> References: <20150507080749.GB1394@zxy.spb.ru> <554B2547.1090307@multiplay.co.uk> <20150507095048.GC1394@zxy.spb.ru> <554B40B6.6060902@multiplay.co.uk> <20150507104655.GT62239@zxy.spb.ru> <554B53E8.4000508@multiplay.co.uk> <20150507120508.GX62239@zxy.spb.ru> <554B5BF9.8020709@multiplay.co.uk> <20150507124416.GD1394@zxy.spb.ru>
next in thread | previous in thread | raw e-mail | index | archive | help
On 07/05/2015 13:44, Slawa Olhovchenkov wrote: > On Thu, May 07, 2015 at 01:35:05PM +0100, Steven Hartland wrote: > >> >> On 07/05/2015 13:05, Slawa Olhovchenkov wrote: >>> On Thu, May 07, 2015 at 01:00:40PM +0100, Steven Hartland wrote: >>> >>>> On 07/05/2015 11:46, Slawa Olhovchenkov wrote: >>>>> On Thu, May 07, 2015 at 11:38:46AM +0100, Steven Hartland wrote: >>>>> >>>>>>>>> How I can cancel this 24 requst? >>>>>>>>> Why this requests don't timeout (3 hours already)? >>>>>>>>> How I can forced detach this disk? (I am lready try `camcontrol reset`, `camconrol rescan`). >>>>>>>>> Why ZFS (or geom) don't timeout on request and don't rerouted to da18? >>>>>>>>> >>>>>>>> If they are in mirrors, in theory you can just pull the disk, isci will >>>>>>>> report to cam and cam will report to ZFS which should all recover. >>>>>>> Yes, zmirror with da18. >>>>>>> I am surprise that ZFS don't use da18. All zpool fully stuck. >>>>>> A single low level request can only be handled by one device, if that >>>>>> device returns an error then ZFS will use the other device, but not until. >>>>> Why next requests don't routed to da18? >>>>> Current request stuck on da19 (unlikely, but understund), but why >>>>> stuck all pool? >>>> Its still waiting for the request from the failed device to complete. As >>>> far as ZFS currently knows there is nothing wrong with the device as its >>>> had no failures. >>> Can you explain some more? >>> One requst waiting, understand. >>> I am do next request. Some information need from vdev with failed >>> disk. Failed disk more busy (queue long), why don't routed to mirror >>> disk? Or, for metadata, to less busy vdev? >> As no error has been reported to ZFS, due to the stalled IO, there is no >> failed vdev. > I see that device isn't failed (for both OS and ZFS). > I am don't talk 'failed vdev'. I am talk 'busy vdev' or 'busy device'. > >> Yes in theory new requests should go to the other vdev, but there could >> be some dependency issues preventing that such as a syncing TXG. > Currenly this pool must not have write activity (from application). > What about go to the other (mirror) device in the same vdev? > Same dependency? Yes, if there's an outstanding TXG, then I believe all IO will stall.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?554B5EB0.1080208>