From owner-freebsd-hackers@FreeBSD.ORG Fri Sep 12 14:37:05 2008 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 200B5106566C for ; Fri, 12 Sep 2008 14:37:05 +0000 (UTC) (envelope-from kpielorz_lst@tdx.co.uk) Received: from mx0.tdx.com (mx1.tdx.com [62.13.128.202]) by mx1.freebsd.org (Postfix) with ESMTP id 90FD28FC0A for ; Fri, 12 Sep 2008 14:37:04 +0000 (UTC) (envelope-from kpielorz_lst@tdx.co.uk) X-Meat-Content: Unsure Received: from Slim64.dmpriest.net.uk (thebrick.dmpriest.net.uk [62.13.130.30]) (authenticated bits=0) by mx0.tdx.com (8.13.8/8.13.8/Kp) with ESMTP id m8CF6se2005874; Fri, 12 Sep 2008 16:06:54 +0100 (BST) Date: Fri, 12 Sep 2008 15:34:30 +0100 From: Karl Pielorz To: Jeremy Chadwick Message-ID: <3BE629D093001F6BA2C6791C@Slim64.dmpriest.net.uk> In-Reply-To: <20080912132102.GB56923@icarus.home.lan> References: <20080912132102.GB56923@icarus.home.lan> X-Mailer: Mulberry/4.0.8 (Win32) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline Cc: freebsd-hackers@freebsd.org Subject: Re: ZFS w/failing drives - any equivalent of Solaris FMA? X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Sep 2008 14:37:05 -0000 --On 12 September 2008 06:21 -0700 Jeremy Chadwick wrote: > As far as I know, there is no such "standard" mechanism in FreeBSD. If > the drive falls off the bus entirely (e.g. detached), I would hope ZFS > would notice that. I can imagine it (might) also depend on if the disk > subsystem you're using is utilising CAM or not (e.g. disks should be daX > not adX); Scott Long might know if something like this is implemented in > CAM. I'm fairly certain nothing like this is implemented in ata(4). For ATA, at the moment - I don't think it'll notice even if a drive detaches. I think like my system the other day, it'll just keep issuing I/O commands to the drive, even if it's disappeared (it might get much 'quicker failures' if the device has 'gone' to the point of FreeBSD just quickly returning 'fail' for every request). > Ideally, it would be the job of the controller and controller driver to > announce to underlying I/O operations fail/success. Do you agree? > > I hope this "FMA Engine" on Solaris only *tells* underlying pieces of > I/O errors, rather than acting on them (e.g. automatically yanking the > disk off the bus for you). I'm in no way shunning Solaris, I'm simply > saying such a mechanism could be as risky/deadly as it could be useful. Yeah, I guess so - I think the way it's meant to happen (and this is only AFAIK) is that FMA 'detects' a failing drive by applying some configurable policy to it. That policy would also include notifying ZFS, so that ZFS could then decide to stop issuing I/O commands to that device. None of this seems to be in place, at least for ATA under FreeBSD - when a drive goes bad, you can just end up with 'hours' worth of I/O timeouts, until someone intervenes. I did enquire on the Open Solaris list about setting limits for 'errors' in ZFS, which netted me a reply that it's FMA (at least in Solaris) that's responsible for this - it just then informs ZFS of the condition. We don't appear (again at least for ATA) to have anything similar for FreeBSD yet :( -Kp