From owner-freebsd-stable@FreeBSD.ORG Thu May 7 14:05:50 2015 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id C7075E4F for ; Thu, 7 May 2015 14:05:50 +0000 (UTC) Received: from gromit.dlib.vt.edu (gromit.dlib.vt.edu [128.173.126.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "gromit.dlib.vt.edu", Issuer "Chumby Certificate Authority" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 9BE4F1A79 for ; Thu, 7 May 2015 14:05:50 +0000 (UTC) Received: from pmather.lib.vt.edu (pmather.lib.vt.edu [128.173.126.193]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by gromit.dlib.vt.edu (Postfix) with ESMTPSA id 48178D35; Thu, 7 May 2015 09:56:12 -0400 (EDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2098\)) Subject: Re: zfs, cam sticking on failed disk From: Paul Mather In-Reply-To: <554B53E8.4000508@multiplay.co.uk> Date: Thu, 7 May 2015 09:56:11 -0400 Cc: Slawa Olhovchenkov , freebsd-stable@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: <51E7F693-AA33-4BDD-8CEA-769D8EC20D36@gromit.dlib.vt.edu> References: <20150507080749.GB1394@zxy.spb.ru> <554B2547.1090307@multiplay.co.uk> <20150507095048.GC1394@zxy.spb.ru> <554B40B6.6060902@multiplay.co.uk> <20150507104655.GT62239@zxy.spb.ru> <554B53E8.4000508@multiplay.co.uk> To: Steven Hartland X-Mailer: Apple Mail (2.2098) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 May 2015 14:05:50 -0000 On May 7, 2015, at 8:00 AM, Steven Hartland = wrote: > On 07/05/2015 11:46, Slawa Olhovchenkov wrote: >> On Thu, May 07, 2015 at 11:38:46AM +0100, Steven Hartland wrote: >>=20 >>>>>> How I can cancel this 24 requst? >>>>>> Why this requests don't timeout (3 hours already)? >>>>>> How I can forced detach this disk? (I am lready try `camcontrol = reset`, `camconrol rescan`). >>>>>> Why ZFS (or geom) don't timeout on request and don't rerouted to = da18? >>>>>>=20 >>>>> If they are in mirrors, in theory you can just pull the disk, isci = will >>>>> report to cam and cam will report to ZFS which should all recover. >>>> Yes, zmirror with da18. >>>> I am surprise that ZFS don't use da18. All zpool fully stuck. >>> A single low level request can only be handled by one device, if = that >>> device returns an error then ZFS will use the other device, but not = until. >> Why next requests don't routed to da18? >> Current request stuck on da19 (unlikely, but understund), but why >> stuck all pool? >=20 > Its still waiting for the request from the failed device to complete. = As far as ZFS currently knows there is nothing wrong with the device as = its had no failures. Maybe related to this, but if the drive stalls indefinitely, is it what = leads to the "panic: I/O to pool 'poolname' appears to be hung on vdev = guid GUID-ID at '/dev/somedevice'."? I have a 6-disk RAIDZ2 pool that is used for nightly rsync backups from = various systems. I believe one of the drives is a bit temperamental. = Very occasionally, I discover the backup has failed and the machine = actually paniced because of this drive, with a panic message like the = above. The panic backtrace includes references to vdev_deadman, which = sounds like some sort of dead man's switch/watchdog. It's a bit counter-intuitive that a single drive wandering off into = la-la land can not only cause an entire ZFS pool to wedge, but, worse = still, panic the whole machine. If I'm understanding this thread correctly, part of the problem is that = an I/O never completing is not the same as a failure to ZFS, and hence = ZFS can't call upon various resources in the pool and mechanisms at its = disposal to correct for that. Is that accurate? I would think that never-ending I/O requests would be a type of failure = that ZFS could sustain. It seems from the "hung on vdev" panic that it = does detect this situation, though the resolution (panic) is not ideal. = :-) Cheers, Paul.=