From owner-freebsd-fs@FreeBSD.ORG Wed Jul 11 16:25:43 2012 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 33A5D1065670 for ; Wed, 11 Jul 2012 16:25:43 +0000 (UTC) (envelope-from chris@behanna.org) Received: from alayta.pair.com (alayta.pair.com [209.68.4.24]) by mx1.freebsd.org (Postfix) with ESMTP id 0F6EA8FC19 for ; Wed, 11 Jul 2012 16:25:43 +0000 (UTC) Received: from tourmalet.ticom-geo.com (unknown [64.132.190.26]) by alayta.pair.com (Postfix) with ESMTPSA id A4324D9837 for ; Wed, 11 Jul 2012 12:16:23 -0400 (EDT) Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Apple Message framework v1278) From: Chris BeHanna In-Reply-To: <1342020754.79202.YahooMailClassic@web122502.mail.ne1.yahoo.com> Date: Wed, 11 Jul 2012 11:16:22 -0500 Content-Transfer-Encoding: quoted-printable Message-Id: <1120F2CC-BFB2-401F-8114-58F3408DF1EF@behanna.org> References: <1342020754.79202.YahooMailClassic@web122502.mail.ne1.yahoo.com> To: freebsd-fs@freebsd.org X-Mailer: Apple Mail (2.1278) Subject: Re: vdev/pool math with combined raidzX vdevs... X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 11 Jul 2012 16:25:43 -0000 On Jul 11, 2012, at 10:32 , Jason Usher wrote: > Since (I think) a lot of raidz3 adoption is due to folks desiring = "some overkill" as they attempt to overcome the "disks got really big = but didn't get any faster (for rebuilds)"[1] ... but they are losing = some of that by combining vdevs in a single pool. >=20 > Not losing so much that they're back down to the failure rate of a = single raidz*2* vdev, but they're not at the overkill level they thought = they were at either. >=20 > I think that's important, or at least worth noting... >=20 >=20 > [1] http://storagegaga.com/4tb-disks-the-end-of-raid/ That, and unrecoverable read errors (UREs) during = reconstruction, are indeed the problem. Gibson, et al, have gone on to = object storage to get around this--RAID is done over the individual = stored objects, rather than over the volume itself. If you need to = reconstruct, you can reconstruct both on-demand and lazily in the = background (i.e., you start reconstructing the objects in a volume, and = if a user attempts to access an as-yet-unreconstructed object, that = object gets inserted at the head of the queue). There aren't, however, to my knowledge, any = good-enough-to-use-at-work-without-hiring-a-pet-kernel-hacker = object-based file systems available for free[1]. CMU PDL did raidframe, = but that was a proof-of-concept and had not been bulletproofed and = optimized (though many of the concepts there found their way into = Panasas's PanFS). In the absence of a ready-to-go (or at least ready-to-assemble) = object-based solution, ZFS is the next best thing. You at least can get = some warning from the parity scrub that objects are corrupted, and can = have some duplicates lying around to recover. That said, you're going = to want to keep your failure domains fairly small, if you can, owing to = the time-to-reconstruct and the inevitability of UREs[2] when volumes = get large enough. --=20 Chris BeHanna chris@behanna.org [1] Because it's very, very hard. Panasas has been at it, full time, = for more than ten years. Spinnaker was at it for a long time, too, = prior to the NetApp acquisition. There's also Storage Tank and GFS, and = there was Zambeel, and a few others. [2] Garth Gibson talks about UREs on page 2: = http://gcn.com/articles/2008/07/25/garth-gibson--faster-storage-systems-th= rough-parallelism.aspx=