From owner-freebsd-fs@freebsd.org  Mon Mar  7 06:04:13 2016
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id C4164AC23CF;
 Mon,  7 Mar 2016 06:04:13 +0000 (UTC)
 (envelope-from richard.elling@gmail.com)
Received: from mail-pf0-x22c.google.com (mail-pf0-x22c.google.com
 [IPv6:2607:f8b0:400e:c00::22c])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 8EABFE31;
 Mon,  7 Mar 2016 06:04:13 +0000 (UTC)
 (envelope-from richard.elling@gmail.com)
Received: by mail-pf0-x22c.google.com with SMTP id x188so49825700pfb.2;
 Sun, 06 Mar 2016 22:04:13 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:subject:from:in-reply-to:date:cc:message-id:references
 :to; bh=inFEdDfH3mpHfQnxUD98ZBD5iQPHBkT3TlVx8a+LMKc=;
 b=Qe7MXPeW1fHNefC6mKPlUO0myUWtk9rFy4YQsDEtJwBtcQrFESb8d5S6JXm39VCRTJ
 mlnuxqRC9xZ1AtYyLMHsaHOLa44TbY5A5T6fyYhPQzCnHPmtNk3a1ee0mxFycoTR6kNf
 XL77b304+ZLXPuo2BN1tFGONuG3GXReF+FhSbZlyLexyCkyItW/e2erD5m8Lwb7xx6JW
 xQXeTFEBBcYpVtZRGK9y2K37xNy3Rw+H71LP4WtKLwzzmryeKzxOsUs+5u/lliBKk+Cp
 7+V3iYfxC7lBDZJp5LeWAFhk7I6pCHsYyCgfvP9EiCNjf5spGL7pPZ5z+gBiTxb84coq
 LKSQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc
 :message-id:references:to;
 bh=inFEdDfH3mpHfQnxUD98ZBD5iQPHBkT3TlVx8a+LMKc=;
 b=iJkjJ8wNEZgFZp2mhIZWJ6cA0NySS9IiJ75rBcqsZAf2sZ2OgtEOeW3RZLuGhbB0z7
 sc/BBuE66qbokd99UJVS9zz2QpSIt3EwxYY9Z6PXiueZBaYrtMgSan3Wz4po2l/0F0wk
 7oyQRBjIpj+6Afo7cLQo10R5BwdvqJts4xxB9x7wq7KQAAOgAOx+WcGic3QZBjnYagg7
 x7Fur6JT4+eMmA0nfLJhc2rvyMZ0nD/htAnKHgbd1kJz1vKCXOVHawjGoqHYnS0PO3Mg
 iEuOvm+ZQSJmH3Df4WFJ4ieDVWIubJrUI0C2wPtLaDQSIYcmzUOLSspCZlJLgJk/BXBw
 Z24A==
X-Gm-Message-State: AD7BkJJTZyaK5yojyae5gPvbGxJ7paFF+cLs5n1WrlAJdoaib6IzrOVzWNosgZlrx380HQ==
X-Received: by 10.98.75.196 with SMTP id d65mr30968996pfj.96.1457330652928;
 Sun, 06 Mar 2016 22:04:12 -0800 (PST)
Received: from [192.168.129.108] ([162.250.162.10])
 by smtp.gmail.com with ESMTPSA id n68sm21255445pfj.46.2016.03.06.22.04.10
 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
 Sun, 06 Mar 2016 22:04:11 -0800 (PST)
Mime-Version: 1.0 (Mac OS X Mail 9.2 \(3112\))
Subject: Re: [zfs] [developer] Re: [smartos-discuss] an interesting survey --
 the zpool with most disks you have ever built
From: Richard Elling <richard.elling@gmail.com>
In-Reply-To: <CALi05Xxm9Sdx9dXCU4C8YhUTZOwPY+NQqzmMEn5d0iFeOES6gw@mail.gmail.com>
Date: Sun, 6 Mar 2016 22:04:09 -0800
Cc: developer@lists.open-zfs.org,
 "smartos-discuss@lists.smartos.org" <smartos-discuss@lists.smartos.org>,
 developer <developer@open-zfs.org>,
 illumos-developer <developer@lists.illumos.org>,
 omnios-discuss <omnios-discuss@lists.omniti.com>,
 Discussion list for OpenIndiana <openindiana-discuss@openindiana.org>,
 "zfs-discuss@list.zfsonlinux.org" <zfs-discuss@list.zfsonlinux.org>,
 "freebsd-fs@FreeBSD.org" <freebsd-fs@freebsd.org>,
 "zfs-devel@freebsd.org" <zfs-devel@freebsd.org>
Message-Id: <6E2B77D1-E0CA-4901-A6BD-6A22C07536B3@gmail.com>
References: <95563acb-d27b-4d4b-b8f3-afeb87a3d599@me.com>
 <CACTb9pxJqk__DPN_pDy4xPvd6ETZtbF9y=B8U7RaeGnn0tKAVQ@mail.gmail.com>
 <CAJjvXiH9Wh+YKngTvv0XG1HtikWggBDwjr_MCb8=Rf276DZO-Q@mail.gmail.com>
 <56D87784.4090103@broken.net>
 <A5A6EA4AE9DCC44F8E7FCB4D6317B1D203178F1DD392@SH-MAIL.ISSI.COM>
 <5158F354-9636-4031-9536-E99450F312B3@RichardElling.com>
 <CALi05Xxm9Sdx9dXCU4C8YhUTZOwPY+NQqzmMEn5d0iFeOES6gw@mail.gmail.com>
To: zfs@lists.illumos.org
X-Mailer: Apple Mail (2.3112)
X-Mailman-Approved-At: Mon, 07 Mar 2016 12:36:29 +0000
Content-Type: text/plain;
	charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.21
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 07 Mar 2016 06:04:14 -0000


> On Mar 6, 2016, at 9:06 PM, Fred Liu <fred.fliu@gmail.com> wrote:
>=20
>=20
>=20
> 2016-03-06 22:49 GMT+08:00 Richard Elling =
<richard.elling@richardelling.com =
<mailto:richard.elling@richardelling.com>>:
>=20
>> On Mar 3, 2016, at 8:35 PM, Fred Liu <Fred_Liu@issi.com =
<mailto:Fred_Liu@issi.com>> wrote:
>>=20
>> Hi,
>>=20
>> Today when I was reading Jeff's new nuclear weapon -- DSSD D5's CUBIC =
RAID introduction,
>> the interesting survey -- the zpool with most disks you have ever =
built popped in my brain.
>=20
> We test to 2,000 drives. Beyond 2,000 there are some scalability =
issues that impact failover times.
> We=E2=80=99ve identified these and know what to fix, but need a real =
customer at this scale to bump it to
> the top of the priority queue.
>=20
> [Fred]: Wow! 2000 drives almost need 4~5 whole racks!=20
>>=20
>> For zfs doesn't support nested vdev, the maximum fault tolerance =
should be three(from raidz3).
>=20
> Pedantically, it is N, because you can have N-way mirroring.
> =20
> [Fred]: Yeah. That is just pedantic. N-way mirroring of every disk =
works in theory and rarely happens in reality.
>=20
>> It is stranded if you want to build a very huge pool.
>=20
> Scaling redundancy by increasing parity improves data loss protection =
by about 3 orders of=20
> magnitude. Adding capacity by striping reduces data loss protection by =
1/N. This is why there is
> not much need to go beyond raidz3. However, if you do want to go =
there, adding raidz4+ is=20
> relatively easy.
>=20
> [Fred]: I assume you used stripped raidz3 vedvs in your storage mesh =
of 2000 drives. If that is true, the possibility of 4/2000 will be not =
so low.
>            Plus, reslivering takes longer time if single disk has =
bigger capacity. And further, the cost of over-provisioning spare disks =
vs raidz4+ will be an deserved=20
>             trade-off when the storage mesh at the scale of 2000 =
drives.

Please don't assume, you'll just hurt yourself :-)
For example, do not assume the only option is striping across raidz3 =
vdevs. Clearly, there are many
different options.
 -- richard