From owner-freebsd-fs@FreeBSD.ORG Thu Nov 1 03:11:09 2012 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id EBEC3A77 for ; Thu, 1 Nov 2012 03:11:09 +0000 (UTC) (envelope-from freebsd@penx.com) Received: from btw.pki2.com (btw.pki2.com [IPv6:2001:470:a:6fd::2]) by mx1.freebsd.org (Postfix) with ESMTP id 95BE38FC08 for ; Thu, 1 Nov 2012 03:11:09 +0000 (UTC) Received: from [127.0.0.1] (localhost [127.0.0.1]) by btw.pki2.com (8.14.5/8.14.5) with ESMTP id qA13B5jD051542; Wed, 31 Oct 2012 20:11:05 -0700 (PDT) (envelope-from freebsd@penx.com) Subject: Re: ZFS RaidZ-2 problems From: Dennis Glatting To: Zaphod Beeblebrox In-Reply-To: References: <508F98F9.3040604@fletchermoorland.co.uk> <1351598684.88435.19.camel@btw.pki2.com> <508FE643.4090107@fletchermoorland.co.uk> <5090010A.4050109@fletchermoorland.co.uk> Content-Type: text/plain; charset="us-ascii" Date: Wed, 31 Oct 2012 20:11:05 -0700 Message-ID: <1351739465.25936.5.camel@btw.pki2.com> Mime-Version: 1.0 X-Mailer: Evolution 2.32.1 FreeBSD GNOME Team Port Content-Transfer-Encoding: 7bit X-yoursite-MailScanner-Information: Dennis Glatting X-yoursite-MailScanner-ID: qA13B5jD051542 X-yoursite-MailScanner: Found to be clean X-MailScanner-From: freebsd@penx.com Cc: freebsd-fs@freebsd.org, Ronald Klop X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Nov 2012 03:11:10 -0000 On Wed, 2012-10-31 at 13:58 -0400, Zaphod Beeblebrox wrote: > I'd start off by saying "smart is your friend." Install smartmontools > and study the somewhat opaque "smartctl -a /dev/mydisk" output > carefully. Try running a short and/or long test, too. Many times the > disk can tell you what the problem is. If too many blocks are being > replaced, your drive is dying. If the drive sees errors in commands > it receives, the cable or the controller are at fault. ZFS itself > does _exceptionally_ well at trying to use what it has. > > I'll also say that bad power supplies make for bad disks. Replacing a > power supply has often been the solution to bad disk problems I've > had. Disks are sensitive to under voltage problems. Brown-outs can > exacerbate this problem. My parents live out where power is very > flaky. Cheap UPSs didn't help much ... but a good power supply can > make all the difference. > To be clear, I am unsure whether my problem was the power supply or the wiring -- it could have been a flaky connector in the strand. I simply replaced it all. I had a 1,000W power supply drawing ~400W on the intake. Assuming 80% efficiency, the power supply should have had plenty of ummpf left. Regardless, the new power supply was cheap compared to my frustration. :) > But I've also had bad controllers of late, too. My most recent > problem had my 9-disk raidZ1 array loose a disk. Smartctl said that > it was loosing blocks fast, so I RMA'd the disk. When the new disk > came, the array just wouldn't heal... it kept loosing the disks > attached to a certain controller. Now it's possible the controller > was bad before the disk had died ... or that it died during the first > attempt at resilver ... or that FreeBSD drivers don't like it anymore > ... I don't know. > > My solution was to get two more 4 drive "pro box" SATA enclosures. > They use a 1-to-4 SATA breakout and the 6 motherboard ports I have are > a revision of the ICH11 intel chipset that supports SATA port > replication (I already had two of these boxes). In this manner I > could remove the defective controller and put all disks onto the > motherboard ICH11 (it actually also allowed me to later expand the > array... but that's not part of this story). > > The upshot was that I now had all the disks present for a raidZ array, > but tonnes of the errors had occured when there were not enough disks. > zpool status -v listed hundresds thousands of files and directories > that were "bad" or lost. But I'd seen this before and started a > scrub. The result of the scrub was: perfect recovery. Actually... it > took a 2nd scrub --- I don't know why. It was happy after the 1st > scrub, but then some checksum errors were found --- and then fixed, so > I scrubbed again ... and that fixed it. > > How does it do it? Unlike other RAID systems, ZFS can tell a bad > block from a good one. When it is asked to re-recover after really > bad multiple failures, it can tell if a block is good or not. This > means that it can choose among alternate or partially recovered > versions and get the right one. Certainly, my above experience would > have been a dead array ... or an array with much loss if I had used > any other RAID technology. > > What does this mean? Well... one thing it means is that for > non-essential systems (say my home media array), using cheap > technology is less risky. None of these is enterprise level > technology, but none of it costs anywhere near what enterprise level, > either. > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"