From owner-freebsd-stable@FreeBSD.ORG Thu Jan 13 01:17:39 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E91DA1065693 for ; Thu, 13 Jan 2011 01:17:39 +0000 (UTC) (envelope-from spawk@acm.poly.edu) Received: from acm.poly.edu (acm.poly.edu [128.238.9.200]) by mx1.freebsd.org (Postfix) with ESMTP id 85BE88FC14 for ; Thu, 13 Jan 2011 01:17:38 +0000 (UTC) Received: (qmail 86799 invoked from network); 13 Jan 2011 00:50:57 -0000 Received: from unknown (HELO ?192.168.0.2?) (spawk@96.224.221.101) by acm.poly.edu with CAMELLIA256-SHA encrypted SMTP; 13 Jan 2011 00:50:57 -0000 Message-ID: <4D2E4C61.80407@acm.poly.edu> Date: Wed, 12 Jan 2011 19:50:41 -0500 From: Boris Kochergin User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.9.2.12) Gecko/20101031 Thunderbird/3.1.6 MIME-Version: 1.0 To: Chris Forgeron References: <4D228F41.7040403@langille.org> <4D23504D.8060103@libeljournal.com> <4D2BD0A7.9060003@langille.org> <4D2C810E.2070007@libeljournal.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-stable Subject: Re: ZFS - hot spares : automatic or not? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Jan 2011 01:17:40 -0000 On 01/12/11 19:32, Chris Forgeron wrote: > Interesting, I was just testing Solaris 11 Express's ability to handle a pulled drive today. It handles it quite well. However, my Areca 1880 drive (arcmsr0) crashes when you reinsert the drive.. but that's another topic, and an issue for Areca tech support.. > > ..back to the point: > > Solaris runs a separate process called Fault Management Daemon (fmd) that looks to handle this logic - This means that it's really not inside the ZFS code to handle this, and FreeBSD would need something similar, hopefully less kludgy than a user script. > > I wonder if anyone has been eyeing the fma code in the cddl with a thought for porting it - It looks to be a really neat bit of code - I'm still quite new with it, having only been working with Solaris the last few months. > > Here's two links to a bit of info on the Solaris daemon: > > http://www.princeton.edu/~unix/Solaris/troubleshoot/fm.html > http://hub.opensolaris.org/bin/view/Community+Group+fm/ > > > Here's my log of the event in Solaris 11 Express: > > Jan 12 21:28:47 solaris fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major > Jan 12 21:28:47 solaris EVENT-TIME: Wed Jan 12 21:28:47 UTC 2011 > Jan 12 21:28:47 solaris PLATFORM: PowerEdge-T710, CSN: 39SLQN1, HOSTNAME: solaris > Jan 12 21:28:47 solaris SOURCE: zfs-diagnosis, REV: 1.0 > Jan 12 21:28:47 solaris EVENT-ID: ccfa7a23-838b-ebc8-decf-c2607afb390d > Jan 12 21:28:47 solaris DESC: The number of I/O errors associated with a ZFS device exceeded > Jan 12 21:28:47 solaris acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD for more information. > Jan 12 21:28:47 solaris AUTO-RESPONSE: The device has been offlined and marked as faulted. An attempt > Jan 12 21:28:47 solaris will be made to activate a hot spare if available. > Jan 12 21:28:47 solaris IMPACT: Fault tolerance of the pool may be compromised. > Jan 12 21:28:47 solaris REC-ACTION: Run 'zpool status -x' and replace the bad device. After a cursory glance at their fault-management infrastructure, I noticed that it also deals with other kinds of stuff like CPU and memory problems, which might make a port painful or impractical. Would the people with custom hot-spare scripts, or nothing automated at all, be content if the sysutils/geomWatch program grew support for hot spares in a future version? I already became somewhat familiar with the userland ZFS API when I added ZFS support to it. -Boris