Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 7 May 2015 09:56:11 -0400
From:      Paul Mather <paul@gromit.dlib.vt.edu>
To:        Steven Hartland <killing@multiplay.co.uk>
Cc:        Slawa Olhovchenkov <slw@zxy.spb.ru>, freebsd-stable@freebsd.org
Subject:   Re: zfs, cam sticking on failed disk
Message-ID:  <51E7F693-AA33-4BDD-8CEA-769D8EC20D36@gromit.dlib.vt.edu>
In-Reply-To: <554B53E8.4000508@multiplay.co.uk>
References:  <20150507080749.GB1394@zxy.spb.ru> <554B2547.1090307@multiplay.co.uk> <20150507095048.GC1394@zxy.spb.ru> <554B40B6.6060902@multiplay.co.uk> <20150507104655.GT62239@zxy.spb.ru> <554B53E8.4000508@multiplay.co.uk>

next in thread | previous in thread | raw e-mail | index | archive | help
On May 7, 2015, at 8:00 AM, Steven Hartland <killing@multiplay.co.uk> =
wrote:

> On 07/05/2015 11:46, Slawa Olhovchenkov wrote:
>> On Thu, May 07, 2015 at 11:38:46AM +0100, Steven Hartland wrote:
>>=20
>>>>>> How I can cancel this 24 requst?
>>>>>> Why this requests don't timeout (3 hours already)?
>>>>>> How I can forced detach this disk? (I am lready try `camcontrol =
reset`, `camconrol rescan`).
>>>>>> Why ZFS (or geom) don't timeout on request and don't rerouted to =
da18?
>>>>>>=20
>>>>> If they are in mirrors, in theory you can just pull the disk, isci =
will
>>>>> report to cam and cam will report to ZFS which should all recover.
>>>> Yes, zmirror with da18.
>>>> I am surprise that ZFS don't use da18. All zpool fully stuck.
>>> A single low level request can only be handled by one device, if =
that
>>> device returns an error then ZFS will use the other device, but not =
until.
>> Why next requests don't routed to da18?
>> Current request stuck on da19 (unlikely, but understund), but why
>> stuck all pool?
>=20
> Its still waiting for the request from the failed device to complete. =
As far as ZFS currently knows there is nothing wrong with the device as =
its had no failures.


Maybe related to this, but if the drive stalls indefinitely, is it what =
leads to the "panic: I/O to pool 'poolname' appears to be hung on vdev =
guid GUID-ID at '/dev/somedevice'."?

I have a 6-disk RAIDZ2 pool that is used for nightly rsync backups from =
various systems.  I believe one of the drives is a bit temperamental.  =
Very occasionally, I discover the backup has failed and the machine =
actually paniced because of this drive, with a panic message like the =
above.  The panic backtrace includes references to vdev_deadman, which =
sounds like some sort of dead man's switch/watchdog.

It's a bit counter-intuitive that a single drive wandering off into =
la-la land can not only cause an entire ZFS pool to wedge, but, worse =
still, panic the whole machine.

If I'm understanding this thread correctly, part of the problem is that =
an I/O never completing is not the same as a failure to ZFS, and hence =
ZFS can't call upon various resources in the pool and mechanisms at its =
disposal to correct for that.  Is that accurate?

I would think that never-ending I/O requests would be a type of failure =
that ZFS could sustain.  It seems from the "hung on vdev" panic that it =
does detect this situation, though the resolution (panic) is not ideal. =
:-)

Cheers,

Paul.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?51E7F693-AA33-4BDD-8CEA-769D8EC20D36>