From owner-freebsd-current@FreeBSD.ORG Sat Jul 12 08:53:20 2008 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2F09D1065670 for ; Sat, 12 Jul 2008 08:53:20 +0000 (UTC) (envelope-from duncan.young@pobox.com) Received: from out3.smtp.messagingengine.com (out3.smtp.messagingengine.com [66.111.4.27]) by mx1.freebsd.org (Postfix) with ESMTP id DD0948FC1A for ; Sat, 12 Jul 2008 08:53:19 +0000 (UTC) (envelope-from duncan.young@pobox.com) Received: from compute1.internal (compute1.internal [10.202.2.41]) by out1.messagingengine.com (Postfix) with ESMTP id 574BB13A39C; Sat, 12 Jul 2008 04:53:19 -0400 (EDT) Received: from heartbeat1.messagingengine.com ([10.202.2.160]) by compute1.internal (MEProxy); Sat, 12 Jul 2008 04:53:19 -0400 X-Sasl-enc: PvbVXLUdZHXdZjN0YvQ2NtbIKx4vqCadVW2MEMWl1VQH 1215852798 Received: from triple0.qld.optushome.com.au (c122-108-168-198.rochd4.qld.optusnet.com.au [122.108.168.198]) by mail.messagingengine.com (Postfix) with ESMTPSA id AE205C0A0; Sat, 12 Jul 2008 04:53:18 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by triple0.qld.optushome.com.au (Postfix) with ESMTP id 68E411B791; Sat, 12 Jul 2008 18:53:16 +1000 (EST) From: Duncan Young To: Brooks Davis Date: Sat, 12 Jul 2008 18:53:15 +1000 User-Agent: KMail/1.9.7 References: <4877A343.2010602@ibctech.ca> <200807121043.10473.duncan.young@pobox.com> <20080712045508.GA28756@lor.one-eyed-alien.net> In-Reply-To: <20080712045508.GA28756@lor.one-eyed-alien.net> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200807121853.16083.duncan.young@pobox.com> Cc: freebsd-current@freebsd.org Subject: Re: Boot from ZFS X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: duncan.young@pobox.com List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 12 Jul 2008 08:53:20 -0000 On Sat, 12 Jul 2008 02:55:08 pm Brooks Davis wrote: > On Sat, Jul 12, 2008 at 10:43:09AM +1000, Duncan Young wrote: > > Be carefull, I've just had a 6 disk raidz array die. Complete failure which > > required restore from backup (the controler card which had access to 4 of the > > disks, lost one disk, then a second (at which point the machine paniced, Upon > > reboot the raidz array was useless (Metadata corrupted)). I'm also getting > > reasonably frequent machine lockups (panics) in the zfs code. I'm going to > > start collecting crash dumps see if anyone can help in the next week or two. > > If you look at the research on disk corruption and failure modes both > in recent proceeding of FAST and the latest issue of ;LOGIN: it's clear > that any RAID-like scheme that does not tolerate double faults is likely > to fail. In theory, zfs should tolerate certain classes of faults > better than some other technologies, but can't deal with full disk > double faults unless you use raidz2. In this case the problem, I believe, was the controller card/software. Unfortunately on a home system, I can't spread my disks across different controllers. The disks themselves are fine (I'm using them right now). My only issue is that the pool seems think its corrupted as soon as an error occurred on the second disk, Even though I would have thought that the meta-data (which I believe is spread across multiple disks) should not have a chance to be made irreparable. i.e. from /var/log/messages: Jul 10 18:39:27 triple0 root: ZFS: vdev I/O failure, zpool=big path=/dev/da0 offset=500028077056 size=1024 error=22 Jul 10 18:39:27 triple0 root: ZFS: vdev failure, zpool=big type=vdev.open_failed Jul 10 18:40:14 triple0 kernel: hptrr: start channel [0,0] Jul 10 18:40:25 triple0 kernel: hptrr: [0 0 ] failed to perform Soft Reset Jul 10 18:40:25 triple0 kernel: hptrr: [0,0,0] device disconnected on channel Jul 10 18:40:25 triple0 root: ZFS: vdev I/O failure, zpool=big path=/dev/da1 offset=56768110080 size=512 error=22 Jul 10 20:15:36 triple0 syslogd: kernel boot file is /boot/kernel/kernel I have had quite a few corruptions from writing to non-parity USB/firewire drives (which have an unfortunate tendency to "drop out" partway through a send/receive, but this just requires a scrub, and destroying of the corrupted snapshot. Never had a problem with importing the pool, until now. regards Duncan > > > I guess what I'm trying to say is, that you can still lose everything on an > > entire pool, so backups are still essential, an a couple of smaller pools is > > probably preferable to one big pool (restore time is less). zfs is not %100 > > (yet?). The lack of any type of fsck still causes me concern. > > Regardless of the technology, backups are essential. If you actually value > your data, off-site backups are essential.