Date: Fri, 6 Jan 2012 14:43:30 -0800 From: Jeremy Chadwick <freebsd@jdc.parodius.com> To: freebsd-stable@freebsd.org Subject: Re: gmirror not synced Message-ID: <20120106224330.GA26856@icarus.home.lan> In-Reply-To: <4F0573B2.9070301@infracaninophile.co.uk> References: <20120104194313.GA2558@lordcow.org> <4F0573B2.9070301@infracaninophile.co.uk>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Jan 05, 2012 at 09:56:02AM +0000, Matthew Seaman wrote:
> On 04/01/2012 19:43, Gareth de Vaux wrote:
> > Hi all, I've noticed that the md5 hashes of a couple of files on
> > a gmirror change when I recalculate the hashes. The output usually
> > cycles between 2 hashes per file.
> >
> > I'm guessing this is because each calculation reads the file
> > randomly from 1 of 2 component drives, and the files in question
> > had a few bit flips during their original sync. I also assume
> > this's something you have to live with for gmirror? Is removing
> > and completely rebuilding the secondary drive the only thing you
> > can do (which might fix these bit flips but incur others elsewhere)?
>
> No, that's not something acceptable at all. Randomly flipping bits in
> files is a really nasty failure mode.
>
> What does 'gmirror list' tell you about the state of the gmirror? Is
> there any possibility that your hardware is failing? Check the SMART
> attributes of the disk in the first instance (it isn't brilliant for
> picking up impending failure, but it should be pretty accurate once the
> drive is actually generating errors.) Also try a few passes of
> memtest86 to try and spot problems with RAM. Cleaning dust out of air
> vents and heatsinks and generally making sure the machine is not
> overheating is a good idea too.
Another possibility is a disk with intermittently faulty cache, or a
drive who has basically given up (firmware bug, design flaw, etc.)
honouring ECC[1][2] when reading/writing sectors.
For the former point, SMART statistics from the drives could help
determine if this is the case, but I stress the word "could". This is
usually stored in Attribute 184 ("End-to-End_Error") but is not
available on very many drives.
Gareth, please install ports/sysutils/smartmontools (make sure it's
version 5.42 or newer) and provide output from "smartctl -x /dev/disk"
and I'll review it for you.
[1]: http://www.storagereview.com/guide/error.html
(read all subsections too)
[2]: http://www.dewassoc.com/kbase/hard_drives/hard_disk_sector_structures.htm
--
| Jeremy Chadwick jdc at parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems Administrator Mountain View, CA, US |
| Making life hard for others since 1977. PGP 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120106224330.GA26856>
