Date: Wed, 05 May 2010 11:03:52 -0400 From: Steve Polyack <korvus@comcast.net> To: Harald Schmalzbauer <h.schmalzbauer@omnilan.de> Cc: FreeBSD Stable <freebsd-stable@freebsd.org> Subject: Re: ZFS (zpool) doesn't detect failed drive Message-ID: <4BE188D8.3090301@comcast.net> In-Reply-To: <4BE18729.3050209@omnilan.de> References: <4BE16784.8050400@omnilan.de> <4BE18729.3050209@omnilan.de>
next in thread | previous in thread | raw e-mail | index | archive | help
On 05/05/10 10:56, Harald Schmalzbauer wrote: > Harald Schmalzbauer schrieb am 05.05.2010 14:41 (localtime): >> Hello, >> >> one drive of my mirror failed today, but 'zpool staus' shows it >> "online". >> Every process using a ZFS mount hangs. Also 'zpool offline /dev/ad1' >> hangs infinitely. > ... > Sorry, I made an error with zpool create. Somehow the little word > "mirror" must have been lost. So the pool wasn't a mirror but a > stripe. Then of course I can't make one vdev offline. Sorry for the > noise. > But I took the opportunity to do some tests with that failing drive > and created a _real_ mirror. That works without failures, but using > the mirror again leads to: > ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left) > ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left) > ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left) > ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left) > ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left) > ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left) > ata3: port is not ready (timeout 10000ms) tfd = 00000080 > ata3: hardware reset timeout > ad1: FAILURE - device detached > > Now zpool reporsts the vdev ad1 still online although it has been > detached and 'atacontrol list' doesn't show it anymore: > > zpool status > pool: URUBAmirrorP1 > state: ONLINE > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are > unaffected. > action: Determine if the device needs to be replaced, and clear the > errors > using 'zpool clear' or replace the device with 'zpool replace'. > see: http://www.sun.com/msg/ZFS-8000-9P > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > URUBAmirrorP1 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > ad1 ONLINE 3 302K 0 > ad2 ONLINE 0 0 0 > > errors: No known data errors > > atacontrol list > ATA channel 2: > Master: ad0 <TRANSCEND/20090520> SATA revision 1.x > Slave: no device present > ATA channel 3: > Master: no device present > Slave: no device present > ATA channel 4: > Master: ad2 <SAMSUNG HD154UI/1AG01118> SATA revision 2.x > Slave: no device present > ATA channel 5: > Master: ad3 <ST3750640NS/3.AEG> SATA revision 1.x > Slave: no device present > > How should such a failure be handled? > Do I have to manually mark the drive offline for zpool? > > Thanks, > > -Harry > You may want to try newer controller drivers like ahci(4) if possible. Otherwise, building the kernel with ATA_CAM may accomplish something similar. I'm not sure, but I'm speculating that the newer ATA/CAM system may feed the proper notifications back to the ZFS systems. I use many drives on the siis(4) driver, which is CAM-enabled, and haven't had any issues. However, I have not had an outright drive failure. I do recall testing situations where we would yank a working drive, and I seem to remember it working correctly after the last set of CAM improvements. It may not be something you can try on a production system, but if you can experiment, it's worth a shot. Note that your device names WILL change to adaX instead of adX. I would definitely recommend you glabel(8) and create the zpool/zdevs using the glabel devices instead to circumvent any future problems associated with device numbering. Steve
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4BE188D8.3090301>