Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 23 Feb 2010 18:44:41 +0100
From:      Harald Schmalzbauer <h.schmalzbauer@omnilan.de>
To:        Alexander Motin <mav@FreeBSD.org>
Cc:        freebsd-stable@FreeBSD.org
Subject:   Re: ahcich timeouts, only with ahci, not with ataahci
Message-ID:  <4B841409.5070603@omnilan.de>
In-Reply-To: <4B8411EE.5030909@FreeBSD.org>
References:  <1266934981.00222684.1266922202@10.7.7.3> <4B83EFD4.8050403@FreeBSD.org> <4B83FD62.2020407@omnilan.de> <4B83FFEF.7010509@FreeBSD.org> <4B840C54.3010304@omnilan.de> <4B8411EE.5030909@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enigCFB73A95EE6E0D03F848B0F4
Content-Type: text/plain; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: quoted-printable

Alexander Motin schrieb am 23.02.2010 18:35 (localtime):
=2E..
>> One understanding question: If the drive doesn't complete a command,
>> regardless if it's due to a firmware bug, a disk surface error or
>> whatever, is there no way for the driver to terminate the request and
>> take the drive offline after some time? This would be a very important=

>> behaviour for me. It doesn't make sense building RAIDz storage when a
>> failing drive hangs the complete machine, even if the system partition=
s
>> are on a complete different SSD.
>=20
> That's what timeouts are used for. When timeout detected, driver resets=

> device and reports error to upper layer. After receiving error, CAM
> reinitializes device. If device is completely dead, reinitialization
> will fail and device will be dropped immediately. If device is still
> alive, reinit succeed and CAM will retry command again. If all retries
> failed, error reported to the GEOM layer and then possibly to file
> system. I have no idea how RAIDZ behaves in such case. May be after few=

> such errors it should drop that device out of array.
>=20
> Timeout is a worst possible case for any device, as it takes too much
> time and doesn't give any recovery information. Half-dead case is worst=

> possible case of timeout. It is difficult to say what which way is
> better: drop last drive from degraded array and lost all info, or retry=

> forever. There is probably no right answer.

I see. Thanks a lot for clarification.
Before getting the machine onsite I did some ZFS tests like removing one =

disk when cvs checkout was running.
I can remember that ZFS hadn't showed the removed drive as offline, but=20
there was no hang. The pool was degraded and after reinserting and=20
rebooting I could resilver the pool. I couldn't manage to get it=20
consistent without rebooting, but I accepted that since I would have to=20
walk on site for changing the drive any way.
I'll restore the default vfs.zfs.txg.timeout=3D30, so the hang can be=20
easily reproduced and see if I can 'camcontrol stop' the drive. Do you=20
think I can get usefull information with that test?

Thanks,

-Harry


--------------enigCFB73A95EE6E0D03F848B0F4
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.13 (FreeBSD)

iEYEARECAAYFAkuEFAkACgkQLDqVQ9VXb8h4MwCfUKBtFqeqn+MqktUGTsTRqV2T
H7gAn3Ki2R5zTt0Zv65fn0yrpmaDqQ9F
=2cn/
-----END PGP SIGNATURE-----

--------------enigCFB73A95EE6E0D03F848B0F4--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4B841409.5070603>