Date: Fri, 12 Sep 2008 17:44:27 +0200 (CEST) From: Oliver Fromme <olli@lurza.secnetix.de> To: freebsd-hackers@FreeBSD.ORG, kpielorz_lst@tdx.co.uk Subject: Re: ZFS w/failing drives - any equivalent of Solaris FMA? Message-ID: <200809121544.m8CFiRHQ099725@lurza.secnetix.de> In-Reply-To: <C984A6E7B1C6657CD8C4F79E@Slim64.dmpriest.net.uk>
next in thread | previous in thread | raw e-mail | index | archive | help
Karl Pielorz wrote: > Recently, a ZFS pool on my FreeBSD box started showing lots of errors on > one drive in a mirrored pair. > > The pool consists of around 14 drives (as 7 mirrored pairs), hung off of a > couple of SuperMicro 8 port SATA controllers (1 drive of each pair is on > each controller). > > One of the drives started picking up a lot of errors (by the end of things > it was returning errors pretty much for any reads/writes issued) - and > taking ages to complete the I/O's. > > However, ZFS kept trying to use the drive - e.g. as I attached another > drive to the remaining 'good' drive in the mirrored pair, ZFS was still > trying to read data off the failed drive (and remaining good one) in order > to complete it's re-silver to the newly attached drive. > > Having posted on the Open Solaris ZFS list - it appears, under Solaris > there's an 'FMA Engine' which communicates drive failures and the like to > ZFS - advising ZFS when a drive should be marked as 'failed'. > > Is there anything similar to this on FreeBSD yet? - i.e. Does/can anything > on the system tell ZFS "This drives experiencing failures" rather than ZFS > just seeing lots of timed out I/O 'errors'? (as appears to be the case). > > In the end, the failing drive was timing out literally every I/O - I did > recover the situation by detaching it from the pool (which hung the machine > - probably caused by ZFS having to update the meta-data on all drives, > including the failed one). A reboot bought the pool back, minus the > 'failed' drive, so enough of the 'detach' must have completed. Did you try "atacontrol detach" to remove the disk from the bus? I haven't tried that with ZFS, but gmirror automatically detects when a disk has gone away, and doesn't try to do anything with it anymore. It certainly should not hang the machine. After all, what's the purpose of a RAID when you have to reboot upon drive failure. ;-) Best regards Oliver -- Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M. Handelsregister: Registergericht Muenchen, HRA 74606, Geschäftsfuehrung: secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün- chen, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd "C++ is over-complicated nonsense. And Bjorn Shoestrap's book a danger to public health. I tried reading it once, I was in recovery for months." -- Cliff Sarginson
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200809121544.m8CFiRHQ099725>