Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 31 Oct 2012 20:11:05 -0700
From:      Dennis Glatting <freebsd@penx.com>
To:        Zaphod Beeblebrox <zbeeble@gmail.com>
Cc:        freebsd-fs@freebsd.org, Ronald Klop <ronald-freebsd8@klop.yi.org>
Subject:   Re: ZFS RaidZ-2 problems
Message-ID:  <1351739465.25936.5.camel@btw.pki2.com>
In-Reply-To: <CACpH0MeJpSg3ti-QUgT=XwaC0jkEo5JeBAfRGPTFfUE6eLJFJg@mail.gmail.com>
References:  <508F98F9.3040604@fletchermoorland.co.uk> <1351598684.88435.19.camel@btw.pki2.com> <508FE643.4090107@fletchermoorland.co.uk> <op.wmz1vtrd8527sy@ronaldradial.versatec.local> <5090010A.4050109@fletchermoorland.co.uk> <op.wm1axoqv8527sy@ronaldradial.versatec.local> <CACpH0MeJpSg3ti-QUgT=XwaC0jkEo5JeBAfRGPTFfUE6eLJFJg@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 2012-10-31 at 13:58 -0400, Zaphod Beeblebrox wrote:
> I'd start off by saying "smart is your friend."  Install smartmontools
> and study the somewhat opaque "smartctl -a /dev/mydisk" output
> carefully.  Try running a short and/or long test, too.  Many times the
> disk can tell you what the problem is.  If too many blocks are being
> replaced, your drive is dying.  If the drive sees errors in commands
> it receives, the cable or the controller are at fault.   ZFS itself
> does _exceptionally_ well at trying to use what it has.
> 
> I'll also say that bad power supplies make for bad disks.  Replacing a
> power supply has often been the solution to bad disk problems I've
> had.  Disks are sensitive to under voltage problems.  Brown-outs can
> exacerbate this problem.  My parents live out where power is very
> flaky.  Cheap UPSs didn't help much ... but a good power supply can
> make all the difference.
> 

To be clear, I am unsure whether my problem was the power supply or the
wiring -- it could have been a flaky connector in the strand. I simply
replaced it all.

I had a 1,000W power supply drawing ~400W on the intake. Assuming 80%
efficiency, the power supply should have had plenty of ummpf left.
Regardless, the new power supply was cheap compared to my
frustration. :)


> But I've also had bad controllers of late, too.  My most recent
> problem had my 9-disk raidZ1 array loose a disk.  Smartctl said that
> it was loosing blocks fast, so I RMA'd the disk.  When the new disk
> came, the array just wouldn't heal... it kept loosing the disks
> attached to a certain controller.  Now it's possible the controller
> was bad before the disk had died ... or that it died during the first
> attempt at resilver ... or that FreeBSD drivers don't like it anymore
> ... I don't know.
> 
> My solution was to get two more 4 drive "pro box" SATA enclosures.
> They use a 1-to-4 SATA breakout and the 6 motherboard ports I have are
> a revision of the ICH11 intel chipset that supports SATA port
> replication (I already had two of these boxes).  In this manner I
> could remove the defective controller and put all disks onto the
> motherboard ICH11 (it actually also allowed me to later expand the
> array... but that's not part of this story).
> 
> The upshot was that I now had all the disks present for a raidZ array,
> but tonnes of the errors had occured when there were not enough disks.
>  zpool status -v listed hundresds thousands of files and directories
> that were "bad" or lost.  But I'd seen this before and started a
> scrub.  The result of the scrub was: perfect recovery.  Actually... it
> took a 2nd scrub --- I don't know why.  It was happy after the 1st
> scrub, but then some checksum errors were found --- and then fixed, so
> I scrubbed again ... and that fixed it.
> 
> How does it do it?  Unlike other RAID systems, ZFS can tell a bad
> block from a good one.  When it is asked to re-recover after really
> bad multiple failures, it can tell if a block is good or not.  This
> means that it can choose among alternate or partially recovered
> versions and get the right one.  Certainly, my above experience would
> have been a dead array ... or an array with much loss if I had used
> any other RAID technology.
> 
> What does this mean?  Well... one thing it means is that for
> non-essential systems (say my home media array), using cheap
> technology is less risky.  None of these is enterprise level
> technology, but none of it costs anywhere near what enterprise level,
> either.
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1351739465.25936.5.camel>