Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 19 Apr 2008 11:26:31 +1000
From:      Gary Newcombe <gary@pattersonsoftware.com>
To:        Christopher Cowart <ccowart@rescomp.berkeley.edu>
Cc:        freebsd-questions@freebsd.org
Subject:   Re: gmirror disk fail questions...
Message-ID:  <20080419112631.5e206e35.gary@pattersonsoftware.com>
In-Reply-To: <20080418174004.GE27135@hal.rescomp.berkeley.edu>
References:  <20080418113305.53b72c64.gary@pattersonsoftware.com> <20080418174004.GE27135@hal.rescomp.berkeley.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 18 Apr 2008 10:40:04 -0700, Christopher Cowart
<ccowart@rescomp.berkeley.edu> wrote:

> Gary Newcombe wrote:
> [...]
> > # gmirror status
> > 
> > [mesh:/var/log]# gmirror status
> >       Name    Status  Components
> > mirror/gm0  DEGRADED  ad4
> > 
> > 
> > looking in /dev/ however, we have
> > 
> > crw-r-----  1 root  operator    0,  83 17 Apr 13:58 ad4
> > crw-r-----  1 root  operator    0,  91 17 Apr 13:58 ad4s1
> > crw-r-----  1 root  operator    0,  84 17 Apr 13:58 ad6
> > crw-r-----  1 root  operator    0,  92 17 Apr 13:58 ad6a
> > crw-r-----  1 root  operator    0,  99 17 Apr 13:58 ad6as1
> > crw-r-----  1 root  operator    0,  93 17 Apr 13:58 ad6b
> > crw-r-----  1 root  operator    0,  94 17 Apr 13:58 ad6c
> > crw-r-----  1 root  operator    0, 100 17 Apr 13:58 ad6cs1
> > crw-r-----  1 root  operator    0,  95 17 Apr 13:58 ad6d
> > crw-r-----  1 root  operator    0,  96 17 Apr 13:58 ad6e
> > crw-r-----  1 root  operator    0,  97 17 Apr 13:58 ad6f
> > crw-r-----  1 root  operator    0,  98 17 Apr 13:58 ad6s1
> > crw-r-----  1 root  operator    0, 101 17 Apr 13:58 ad6s1a
> > crw-r-----  1 root  operator    0, 102 17 Apr 13:58 ad6s1b
> > crw-r-----  1 root  operator    0, 103 17 Apr 13:58 ad6s1c
> > crw-r-----  1 root  operator    0, 104 17 Apr 13:58 ad6s1d
> > crw-r-----  1 root  operator    0, 105 17 Apr 13:58 ad6s1e
> > crw-r-----  1 root  operator    0, 106 17 Apr 13:58 ad6s1f
> > 
> > I am guessing that a failing disk is responsible for the data
> > corruption, but I have no errors in /var/log/messages or console.log.
> > On every boot, the mirror is marked clean ad there's no warnings about
> > a disk failing anywhere? Where should I be looking for or what should I
> > be doing to get any warnings?
> > 
> > Also, how-come if ad4 is the working disk, ad4's slices seem to be
> > labelled as ad6. What's going on here? To me, ad6 appears to have
> > correct labelling for the mirror from ad6s1a-f
> 
> I believe the kernel hides individual labels for a gmirror volume. The
> labels on ad4 should be visible in /dev/mirror/. Because gmirror really
> just mirrors the data block by block (with a little bit of meta data at
> the very end of the drive), once the drive is no longer a member of an
> array, the kernel treats it as an individual drive and allows visibility
> of all the labels.

OK, so not to worry about the slices.

> 
> > How can I test for sure whether the disk is damaged or dying, or
> > whether this is just a temporary glitch in the mirror? This is the
> > first time I've had a gmirror raid give me problems.
> 
> The first time a drive gets kicked out, I typically try to re-insert it.
> We have monitoring, so we receive notifications if it fails again. After
> that, I get the vendor to replace it. 
> 
> > Assuming ad6 has been deactivated/disconnected, I was thinking of
> > trying:
> > 
> > gmirror activate gm0 ad6
> > gmirror rebuild gm0 ad6
> > 
> > Is this safe?
> 
> You have to kick ad6 out and re-insert it:
> # gmirror forget
> # gmirror insert gm0 /dev/ad6
> 
> After doing that, I would watch closely for a while in case your drive
> is actually failing. I've written a small nagios check for gmirror; let
> me know if you'd like me to send it (it could easily be adapted to a
> cron job). You can also get `gmirror status' output in your dailies by
> adding daily_status_gmirror_enable="YES" to /etc/periodic.conf.

I've since added the gmirror entry to periodic.conf, but your script
sounds ideal. I would like that, thanks. I would much rather get some
warning about this happening as it does appear to have caused some data
corruption.

> 
> But, given it's timing out on boot, I would personally bag the drive and
> replace it. You'll still need to run the same 2 commands above.

[mesh:/dev/mirror]# gmirror forget
Missing device(s).

[mesh:/dev/mirror]# gmirror status
      Name    Status  Components
mirror/gm0  DEGRADED  ad4

[mesh:/dev/mirror]# gmirror insert gm0 /dev/ad6
Not all disks connected.

Looks like it is new disk time then after all.
Thanks for your advice.

Gary

> 
> -- 
> Chris Cowart
> Network Technical Lead
> Network & Infrastructure Services, RSSP-IT
> UC Berkeley
> 



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080419112631.5e206e35.gary>