Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 05 May 2010 11:03:52 -0400
From:      Steve Polyack <korvus@comcast.net>
To:        Harald Schmalzbauer <h.schmalzbauer@omnilan.de>
Cc:        FreeBSD Stable <freebsd-stable@freebsd.org>
Subject:   Re: ZFS (zpool) doesn't detect failed drive
Message-ID:  <4BE188D8.3090301@comcast.net>
In-Reply-To: <4BE18729.3050209@omnilan.de>
References:  <4BE16784.8050400@omnilan.de> <4BE18729.3050209@omnilan.de>

next in thread | previous in thread | raw e-mail | index | archive | help
On 05/05/10 10:56, Harald Schmalzbauer wrote:
> Harald Schmalzbauer schrieb am 05.05.2010 14:41 (localtime):
>> Hello,
>>
>> one drive of my mirror failed today, but 'zpool staus' shows it 
>> "online".
>> Every process using a ZFS mount hangs. Also 'zpool offline /dev/ad1' 
>> hangs infinitely.
> ...
> Sorry, I made an error with zpool create. Somehow the little word 
> "mirror" must have been lost. So the pool wasn't a mirror but a 
> stripe. Then of course I can't make one vdev offline. Sorry for the 
> noise.
> But I took the opportunity to do some tests with that failing drive 
> and created a _real_ mirror. That works without failures, but using 
> the mirror again leads to:
> ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left)
> ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left)
> ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left)
> ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left)
> ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left)
> ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left)
> ata3: port is not ready (timeout 10000ms) tfd = 00000080
> ata3: hardware reset timeout
> ad1: FAILURE - device detached
>
> Now zpool reporsts the vdev ad1 still online although it has been 
> detached and 'atacontrol list' doesn't show it anymore:
>
> zpool status
>   pool: URUBAmirrorP1
>  state: ONLINE
> status: One or more devices has experienced an unrecoverable error.  An
>         attempt was made to correct the error.  Applications are 
> unaffected.
> action: Determine if the device needs to be replaced, and clear the 
> errors
>         using 'zpool clear' or replace the device with 'zpool replace'.
>    see: http://www.sun.com/msg/ZFS-8000-9P
>  scrub: none requested
> config:
>
>         NAME        STATE     READ WRITE CKSUM
>         URUBAmirrorP1  ONLINE       0     0     0
>           mirror    ONLINE       0     0     0
>             ad1     ONLINE       3  302K     0
>             ad2     ONLINE       0     0     0
>
> errors: No known data errors
>
> atacontrol list
> ATA channel 2:
>     Master:  ad0 <TRANSCEND/20090520> SATA revision 1.x
>     Slave:       no device present
> ATA channel 3:
>     Master:      no device present
>     Slave:       no device present
> ATA channel 4:
>     Master:  ad2 <SAMSUNG HD154UI/1AG01118> SATA revision 2.x
>     Slave:       no device present
> ATA channel 5:
>     Master:  ad3 <ST3750640NS/3.AEG> SATA revision 1.x
>     Slave:       no device present
>
> How should such a failure be handled?
> Do I have to manually mark the drive offline for zpool?
>
> Thanks,
>
> -Harry
>
You may want to try newer controller drivers like ahci(4) if possible.  
Otherwise, building the kernel with ATA_CAM may accomplish something 
similar.  I'm not sure, but I'm speculating that the newer ATA/CAM 
system may feed the proper notifications back to the ZFS systems.

I use many drives on the siis(4) driver, which is CAM-enabled, and 
haven't had any issues.  However, I have not had an outright drive 
failure.  I do recall testing situations where we would yank a working 
drive, and I seem to remember it working correctly after the last set of 
CAM improvements.

It may not be something you can try on a production system, but if you 
can experiment, it's worth a shot.  Note that your device names WILL 
change to adaX instead of adX.  I would definitely recommend you 
glabel(8) and create the zpool/zdevs using the glabel devices instead to 
circumvent any future problems associated with device numbering.

Steve



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4BE188D8.3090301>