From owner-freebsd-questions@FreeBSD.ORG Sat Apr 19 01:26:46 2008 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9B98A106564A for ; Sat, 19 Apr 2008 01:26:46 +0000 (UTC) (envelope-from gary@pattersonsoftware.com) Received: from nskntmtas04p.mx.bigpond.com (nskntmtas04p.mx.bigpond.com [61.9.168.146]) by mx1.freebsd.org (Postfix) with ESMTP id 24F3F8FC13 for ; Sat, 19 Apr 2008 01:26:45 +0000 (UTC) (envelope-from gary@pattersonsoftware.com) Received: from nskntotgx02p.mx.bigpond.com ([121.223.241.235]) by nskntmtas04p.mx.bigpond.com with ESMTP id <20080419012644.LEYI26608.nskntmtas04p.mx.bigpond.com@nskntotgx02p.mx.bigpond.com>; Sat, 19 Apr 2008 01:26:44 +0000 Received: from mail.pattersonsoftware.com ([121.223.241.235]) by nskntotgx02p.mx.bigpond.com with ESMTP id <20080419012642.YZFN1995.nskntotgx02p.mx.bigpond.com@mail.pattersonsoftware.com>; Sat, 19 Apr 2008 01:26:42 +0000 Received: from localhost (mail [192.168.111.46]) by mail.pattersonsoftware.com (Postfix) with ESMTP id D7853536070; Sat, 19 Apr 2008 11:26:41 +1000 (EST) X-Virus-Scanned: amavisd-new at pattersonsoftware.com Received: from mail.pattersonsoftware.com ([192.168.111.46]) by localhost (mail.pattersonsoftware.com [192.168.111.46]) (amavisd-new, port 10024) with ESMTP id HbXz36dBotCo; Sat, 19 Apr 2008 11:26:34 +1000 (EST) Received: from elegia (60-242-254-180.static.tpgi.com.au [60.242.254.180]) by mail.pattersonsoftware.com (Postfix) with ESMTP id 7A063536042; Sat, 19 Apr 2008 11:26:33 +1000 (EST) Date: Sat, 19 Apr 2008 11:26:31 +1000 From: Gary Newcombe To: Christopher Cowart Message-Id: <20080419112631.5e206e35.gary@pattersonsoftware.com> In-Reply-To: <20080418174004.GE27135@hal.rescomp.berkeley.edu> References: <20080418113305.53b72c64.gary@pattersonsoftware.com> <20080418174004.GE27135@hal.rescomp.berkeley.edu> Organization: Patterson Software X-Mailer: Sylpheed 2.4.8 (GTK+ 2.12.5; x86_64-redhat-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-RPD-ScanID: Class unknown; VirusThreatLevel unknown, RefID str=0001.0A150205.48094A53.006E,ss=1,fgs=0 Cc: freebsd-questions@freebsd.org Subject: Re: gmirror disk fail questions... X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 19 Apr 2008 01:26:46 -0000 On Fri, 18 Apr 2008 10:40:04 -0700, Christopher Cowart wrote: > Gary Newcombe wrote: > [...] > > # gmirror status > > > > [mesh:/var/log]# gmirror status > > Name Status Components > > mirror/gm0 DEGRADED ad4 > > > > > > looking in /dev/ however, we have > > > > crw-r----- 1 root operator 0, 83 17 Apr 13:58 ad4 > > crw-r----- 1 root operator 0, 91 17 Apr 13:58 ad4s1 > > crw-r----- 1 root operator 0, 84 17 Apr 13:58 ad6 > > crw-r----- 1 root operator 0, 92 17 Apr 13:58 ad6a > > crw-r----- 1 root operator 0, 99 17 Apr 13:58 ad6as1 > > crw-r----- 1 root operator 0, 93 17 Apr 13:58 ad6b > > crw-r----- 1 root operator 0, 94 17 Apr 13:58 ad6c > > crw-r----- 1 root operator 0, 100 17 Apr 13:58 ad6cs1 > > crw-r----- 1 root operator 0, 95 17 Apr 13:58 ad6d > > crw-r----- 1 root operator 0, 96 17 Apr 13:58 ad6e > > crw-r----- 1 root operator 0, 97 17 Apr 13:58 ad6f > > crw-r----- 1 root operator 0, 98 17 Apr 13:58 ad6s1 > > crw-r----- 1 root operator 0, 101 17 Apr 13:58 ad6s1a > > crw-r----- 1 root operator 0, 102 17 Apr 13:58 ad6s1b > > crw-r----- 1 root operator 0, 103 17 Apr 13:58 ad6s1c > > crw-r----- 1 root operator 0, 104 17 Apr 13:58 ad6s1d > > crw-r----- 1 root operator 0, 105 17 Apr 13:58 ad6s1e > > crw-r----- 1 root operator 0, 106 17 Apr 13:58 ad6s1f > > > > I am guessing that a failing disk is responsible for the data > > corruption, but I have no errors in /var/log/messages or console.log. > > On every boot, the mirror is marked clean ad there's no warnings about > > a disk failing anywhere? Where should I be looking for or what should I > > be doing to get any warnings? > > > > Also, how-come if ad4 is the working disk, ad4's slices seem to be > > labelled as ad6. What's going on here? To me, ad6 appears to have > > correct labelling for the mirror from ad6s1a-f > > I believe the kernel hides individual labels for a gmirror volume. The > labels on ad4 should be visible in /dev/mirror/. Because gmirror really > just mirrors the data block by block (with a little bit of meta data at > the very end of the drive), once the drive is no longer a member of an > array, the kernel treats it as an individual drive and allows visibility > of all the labels. OK, so not to worry about the slices. > > > How can I test for sure whether the disk is damaged or dying, or > > whether this is just a temporary glitch in the mirror? This is the > > first time I've had a gmirror raid give me problems. > > The first time a drive gets kicked out, I typically try to re-insert it. > We have monitoring, so we receive notifications if it fails again. After > that, I get the vendor to replace it. > > > Assuming ad6 has been deactivated/disconnected, I was thinking of > > trying: > > > > gmirror activate gm0 ad6 > > gmirror rebuild gm0 ad6 > > > > Is this safe? > > You have to kick ad6 out and re-insert it: > # gmirror forget > # gmirror insert gm0 /dev/ad6 > > After doing that, I would watch closely for a while in case your drive > is actually failing. I've written a small nagios check for gmirror; let > me know if you'd like me to send it (it could easily be adapted to a > cron job). You can also get `gmirror status' output in your dailies by > adding daily_status_gmirror_enable="YES" to /etc/periodic.conf. I've since added the gmirror entry to periodic.conf, but your script sounds ideal. I would like that, thanks. I would much rather get some warning about this happening as it does appear to have caused some data corruption. > > But, given it's timing out on boot, I would personally bag the drive and > replace it. You'll still need to run the same 2 commands above. [mesh:/dev/mirror]# gmirror forget Missing device(s). [mesh:/dev/mirror]# gmirror status Name Status Components mirror/gm0 DEGRADED ad4 [mesh:/dev/mirror]# gmirror insert gm0 /dev/ad6 Not all disks connected. Looks like it is new disk time then after all. Thanks for your advice. Gary > > -- > Chris Cowart > Network Technical Lead > Network & Infrastructure Services, RSSP-IT > UC Berkeley >