From owner-freebsd-hackers@FreeBSD.ORG Fri Sep 12 15:44:30 2008 Return-Path: Delivered-To: freebsd-hackers@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7C7B8106564A for ; Fri, 12 Sep 2008 15:44:30 +0000 (UTC) (envelope-from olli@lurza.secnetix.de) Received: from lurza.secnetix.de (unknown [IPv6:2a01:170:102f::2]) by mx1.freebsd.org (Postfix) with ESMTP id E40DB8FC21 for ; Fri, 12 Sep 2008 15:44:29 +0000 (UTC) (envelope-from olli@lurza.secnetix.de) Received: from lurza.secnetix.de (localhost [127.0.0.1]) by lurza.secnetix.de (8.14.3/8.14.3) with ESMTP id m8CFiRce099726; Fri, 12 Sep 2008 17:44:28 +0200 (CEST) (envelope-from oliver.fromme@secnetix.de) Received: (from olli@localhost) by lurza.secnetix.de (8.14.3/8.14.3/Submit) id m8CFiRHQ099725; Fri, 12 Sep 2008 17:44:27 +0200 (CEST) (envelope-from olli) Date: Fri, 12 Sep 2008 17:44:27 +0200 (CEST) Message-Id: <200809121544.m8CFiRHQ099725@lurza.secnetix.de> From: Oliver Fromme To: freebsd-hackers@FreeBSD.ORG, kpielorz_lst@tdx.co.uk In-Reply-To: X-Newsgroups: list.freebsd-hackers User-Agent: tin/1.8.3-20070201 ("Scotasay") (UNIX) (FreeBSD/6.4-PRERELEASE-20080904 (i386)) MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-2.1.2 (lurza.secnetix.de [127.0.0.1]); Fri, 12 Sep 2008 17:44:28 +0200 (CEST) Cc: Subject: Re: ZFS w/failing drives - any equivalent of Solaris FMA? X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: freebsd-hackers@FreeBSD.ORG, kpielorz_lst@tdx.co.uk List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Sep 2008 15:44:30 -0000 Karl Pielorz wrote: > Recently, a ZFS pool on my FreeBSD box started showing lots of errors on > one drive in a mirrored pair. > > The pool consists of around 14 drives (as 7 mirrored pairs), hung off of a > couple of SuperMicro 8 port SATA controllers (1 drive of each pair is on > each controller). > > One of the drives started picking up a lot of errors (by the end of things > it was returning errors pretty much for any reads/writes issued) - and > taking ages to complete the I/O's. > > However, ZFS kept trying to use the drive - e.g. as I attached another > drive to the remaining 'good' drive in the mirrored pair, ZFS was still > trying to read data off the failed drive (and remaining good one) in order > to complete it's re-silver to the newly attached drive. > > Having posted on the Open Solaris ZFS list - it appears, under Solaris > there's an 'FMA Engine' which communicates drive failures and the like to > ZFS - advising ZFS when a drive should be marked as 'failed'. > > Is there anything similar to this on FreeBSD yet? - i.e. Does/can anything > on the system tell ZFS "This drives experiencing failures" rather than ZFS > just seeing lots of timed out I/O 'errors'? (as appears to be the case). > > In the end, the failing drive was timing out literally every I/O - I did > recover the situation by detaching it from the pool (which hung the machine > - probably caused by ZFS having to update the meta-data on all drives, > including the failed one). A reboot bought the pool back, minus the > 'failed' drive, so enough of the 'detach' must have completed. Did you try "atacontrol detach" to remove the disk from the bus? I haven't tried that with ZFS, but gmirror automatically detects when a disk has gone away, and doesn't try to do anything with it anymore. It certainly should not hang the machine. After all, what's the purpose of a RAID when you have to reboot upon drive failure. ;-) Best regards Oliver -- Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M. Handelsregister: Registergericht Muenchen, HRA 74606, Geschäftsfuehrung: secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün- chen, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd "C++ is over-complicated nonsense. And Bjorn Shoestrap's book a danger to public health. I tried reading it once, I was in recovery for months." -- Cliff Sarginson