Date: Thu, 7 May 2015 09:56:11 -0400 From: Paul Mather <paul@gromit.dlib.vt.edu> To: Steven Hartland <killing@multiplay.co.uk> Cc: Slawa Olhovchenkov <slw@zxy.spb.ru>, freebsd-stable@freebsd.org Subject: Re: zfs, cam sticking on failed disk Message-ID: <51E7F693-AA33-4BDD-8CEA-769D8EC20D36@gromit.dlib.vt.edu> In-Reply-To: <554B53E8.4000508@multiplay.co.uk> References: <20150507080749.GB1394@zxy.spb.ru> <554B2547.1090307@multiplay.co.uk> <20150507095048.GC1394@zxy.spb.ru> <554B40B6.6060902@multiplay.co.uk> <20150507104655.GT62239@zxy.spb.ru> <554B53E8.4000508@multiplay.co.uk>
next in thread | previous in thread | raw e-mail | index | archive | help
On May 7, 2015, at 8:00 AM, Steven Hartland <killing@multiplay.co.uk> = wrote: > On 07/05/2015 11:46, Slawa Olhovchenkov wrote: >> On Thu, May 07, 2015 at 11:38:46AM +0100, Steven Hartland wrote: >>=20 >>>>>> How I can cancel this 24 requst? >>>>>> Why this requests don't timeout (3 hours already)? >>>>>> How I can forced detach this disk? (I am lready try `camcontrol = reset`, `camconrol rescan`). >>>>>> Why ZFS (or geom) don't timeout on request and don't rerouted to = da18? >>>>>>=20 >>>>> If they are in mirrors, in theory you can just pull the disk, isci = will >>>>> report to cam and cam will report to ZFS which should all recover. >>>> Yes, zmirror with da18. >>>> I am surprise that ZFS don't use da18. All zpool fully stuck. >>> A single low level request can only be handled by one device, if = that >>> device returns an error then ZFS will use the other device, but not = until. >> Why next requests don't routed to da18? >> Current request stuck on da19 (unlikely, but understund), but why >> stuck all pool? >=20 > Its still waiting for the request from the failed device to complete. = As far as ZFS currently knows there is nothing wrong with the device as = its had no failures. Maybe related to this, but if the drive stalls indefinitely, is it what = leads to the "panic: I/O to pool 'poolname' appears to be hung on vdev = guid GUID-ID at '/dev/somedevice'."? I have a 6-disk RAIDZ2 pool that is used for nightly rsync backups from = various systems. I believe one of the drives is a bit temperamental. = Very occasionally, I discover the backup has failed and the machine = actually paniced because of this drive, with a panic message like the = above. The panic backtrace includes references to vdev_deadman, which = sounds like some sort of dead man's switch/watchdog. It's a bit counter-intuitive that a single drive wandering off into = la-la land can not only cause an entire ZFS pool to wedge, but, worse = still, panic the whole machine. If I'm understanding this thread correctly, part of the problem is that = an I/O never completing is not the same as a failure to ZFS, and hence = ZFS can't call upon various resources in the pool and mechanisms at its = disposal to correct for that. Is that accurate? I would think that never-ending I/O requests would be a type of failure = that ZFS could sustain. It seems from the "hung on vdev" panic that it = does detect this situation, though the resolution (panic) is not ideal. = :-) Cheers, Paul.=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?51E7F693-AA33-4BDD-8CEA-769D8EC20D36>