Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 31 Oct 2006 15:52:41 -0600
From:      "Rick C. Petty" <rick-freebsd@kiwi-computer.com>
Cc:        freebsd-geom@freebsd.org
Subject:   Re: burnt again by gmirror
Message-ID:  <20061031215241.GA57997@keira.kiwi-computer.com>
In-Reply-To: <20061031205857.GA15861@garage.freebsd.pl>
References:  <20061031195442.GA55478@keira.kiwi-computer.com> <4547AD9B.5050503@centtech.com> <20061031204659.GA56766@keira.kiwi-computer.com> <20061031205857.GA15861@garage.freebsd.pl>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Oct 31, 2006 at 09:58:57PM +0100, Pawel Jakub Dawidek wrote:
> On Tue, Oct 31, 2006 at 02:46:59PM -0600, Rick C. Petty wrote:
> > 
> > Still, I'm curious why/how ad8's metadata could have been clobbered.
> > gmirror is the only one who would write to it, the filesystem is mounted
> > from gm0* -- kinda scary.  I guess the lesson here is to use simple gmirror
> > configurations in case the metadata gets clobbered.
> 
> gmirror told you that it think ad8 is broken and skipped it.

If it's marked as broken/disconnected/whatever, why is it removed from the
list ("gmirror list")?  Surely it would be useful to state which pieces are
broken.  I'm thinking:  printing "State: BROKEN" or something similar would
do the trick.  At least some of this data is available to gmirror, as
stored on the other providers..  perhaps the provider name isn't, because
it couldn't find the provider (for whatever reason), is that why it's not
listed?  If so, I still think it should be, perhaps with "Name: UNKNOWN".
As it is, there is no way of knowing what's missing, or at least providing
clues to what could be missing.

> If an error is discovered on mirror's component it is marked as broken
> and disconnected so it doesn't case further problems. For example disk
> problem is that it doesn't complete I/O requests and gmirror need to
> wait for ATA timeouts, which will make the whole system unresponsive.
> If component was disconnected it means something was wrong with it and
> it needs manual intervention and investigation.

Or something just went horribly wrong with gmirror?   The disks are fine,
relatively new (5600 hours runtime), tested under load pretty thoroughly,
and smartctl is showing no errors or other anomalies.

> I'm sure your logs would tell you.

Perhaps you didn't read in my original post where I stated that
/var/run/dmesg.boot was empty:

# ls -la /var/run/dmesg.boot
-rw-r--r--  1 root  wheel  0 Oct 25 14:42 /var/run/dmesg.boot

and that the message buffer had overflowed (quite visible from
/var/log/messages)-- due to overwhelming fsck errors.  I would love to
know the wonderful error message which was printed, to give me an idea
why gmirror dropped the disk for seemingly no good reason.

I wish either fsck was less noisy or that the kernel would take a
snapshot of the msgbuf right before starting init, so only the kernel
messages would get copied into /var/run/ when the rc scripts do their
thing...

-- Rick C. Petty



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20061031215241.GA57997>