From owner-freebsd-fs@freebsd.org  Mon Mar  7 20:55:48 2016
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 368BCAC2594;
 Mon,  7 Mar 2016 20:55:48 +0000 (UTC)
 (envelope-from lslusser@gmail.com)
Received: from mail-vk0-x22c.google.com (mail-vk0-x22c.google.com
 [IPv6:2607:f8b0:400c:c05::22c])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id DCE601CC6;
 Mon,  7 Mar 2016 20:55:47 +0000 (UTC)
 (envelope-from lslusser@gmail.com)
Received: by mail-vk0-x22c.google.com with SMTP id c3so131819638vkb.3;
 Mon, 07 Mar 2016 12:55:47 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:in-reply-to:references:date:message-id:subject:from:to
 :cc; bh=DAhuTpXsSg6UCNBJKnagMJx92BEndeEQ0wRXhWRr/9o=;
 b=kOQWYCAWLGZ/k0dYMPNZDS1CMXMiL7MGjFIsvCSPnuWZcvZos6SqJFhg9l8TA9NxRM
 RYjf8cHlrYGFoBEOtiJKH8gguoulSrFw8DRGzd0lHvHg+WQ94W/vEALijLQvV3LvGX+S
 0f2AwOhVAVDJ0gYnuuPn+g3qTXra1Bk7auF6r52RA634XmvA7erO0tzuUbGBC7OJ34lt
 QCSBkuy9h8GrLAax23XSiFU5yCH24RHpmsqHTd6JQqEp0eaxlmHF108UTei8H1AIm+k1
 7GxNEUqg/FhngQt8lAwSz3geAtIGaNY13agGJshbDMWKp8yjwG88d2V8YMXJRicUNVlz
 djaQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:mime-version:in-reply-to:references:date
 :message-id:subject:from:to:cc;
 bh=DAhuTpXsSg6UCNBJKnagMJx92BEndeEQ0wRXhWRr/9o=;
 b=i/tWMkf2vl4Q0DVj+ikWF0O9vhczKg6FWkrxI5e6WD3HgxMOY03yL9yxwJ8JGZfvIn
 BeYjXA8Maa9wbmqObHZnQsKMEfnCORAMLxziIuzU3rl6rJ8NUs5JOtg5cLVa678PR9XH
 ZSSSE/rqPVQe6wHlSAPVJ5vE4wCkOlhUOJE7ipoq4vU/oPBmvb8RekQIaGeMjE7h+G7v
 mkZhpSUpubB6FbTYvDRaJT1e3iATCFS8ZafzL4TPPbf2ptLTEkcD+npoIXsodRh9/fO4
 NLuqoH3wk8vlVAv0GC50qVgqUc+x1WSuvV6cI689n8oh59KGlGCXilFQw3VZuEm1nLrS
 XYJg==
X-Gm-Message-State: AD7BkJLIjb4LJEWKPqsH2IZtPmrLHwaLBwEa3Gdw8OkJyxJUSwX4lg+Gm3piTZq0OFV4WGHx4zKlLAZgy8dtVA==
MIME-Version: 1.0
X-Received: by 10.31.174.23 with SMTP id x23mr20192994vke.136.1457384146585;
 Mon, 07 Mar 2016 12:55:46 -0800 (PST)
Received: by 10.176.1.166 with HTTP; Mon, 7 Mar 2016 12:55:46 -0800 (PST)
In-Reply-To: <CALi05Xw1NGqZhXcS4HweX7AK0DU_mm01tj=rjB+qOU9N0-N=ng@mail.gmail.com>
References: <95563acb-d27b-4d4b-b8f3-afeb87a3d599@me.com>
 <CACTb9pxJqk__DPN_pDy4xPvd6ETZtbF9y=B8U7RaeGnn0tKAVQ@mail.gmail.com>
 <CAJjvXiH9Wh+YKngTvv0XG1HtikWggBDwjr_MCb8=Rf276DZO-Q@mail.gmail.com>
 <56D87784.4090103@broken.net>
 <A5A6EA4AE9DCC44F8E7FCB4D6317B1D203178F1DD392@SH-MAIL.ISSI.COM>
 <5158F354-9636-4031-9536-E99450F312B3@RichardElling.com>
 <CALi05Xxm9Sdx9dXCU4C8YhUTZOwPY+NQqzmMEn5d0iFeOES6gw@mail.gmail.com>
 <6E2B77D1-E0CA-4901-A6BD-6A22C07536B3@gmail.com>
 <CALi05Xw1NGqZhXcS4HweX7AK0DU_mm01tj=rjB+qOU9N0-N=ng@mail.gmail.com>
Date: Mon, 7 Mar 2016 12:55:46 -0800
Message-ID: <CAESZ+_-+1jKQC880bew-maDyZ_xnMmB7QxPHyKAc_3P44+m+uQ@mail.gmail.com>
Subject: Re: [zfs] [developer] Re: [smartos-discuss] an interesting survey --
 the zpool with most disks you have ever built
From: Liam Slusser <lslusser@gmail.com>
To: zfs@lists.illumos.org
Cc: "smartos-discuss@lists.smartos.org" <smartos-discuss@lists.smartos.org>,
 developer@lists.open-zfs.org, developer <developer@open-zfs.org>,
 illumos-developer <developer@lists.illumos.org>, 
 omnios-discuss <omnios-discuss@lists.omniti.com>, 
 Discussion list for OpenIndiana <openindiana-discuss@openindiana.org>, 
 "zfs-discuss@list.zfsonlinux.org" <zfs-discuss@list.zfsonlinux.org>, 
 "freebsd-fs@FreeBSD.org" <freebsd-fs@freebsd.org>,
 "zfs-devel@freebsd.org" <zfs-devel@freebsd.org>
X-Mailman-Approved-At: Mon, 07 Mar 2016 21:59:51 +0000
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.21
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 07 Mar 2016 20:55:48 -0000

I don't have a 2000 drive array (thats amazing!) but I do have two 280
drive arrays which are in production.  Here are the generic stats:

server setup:
OpenIndiana oi_151
1 server rack
Dell r720xd 64g ram with mirrored 250g boot disks
5 x LSI 9207-8e dualport SAS pci-e host bus adapters
Intel 10g fibre ethernet (dual port)
2 x SSD for log cache
2 x SSD for cache
23 x Dell MD1200 with 3T,4T, or 6T NLSAS disks (a mix of Toshiba, Western
Digital, and Seagate drives - basically whatever Dell sends)

zpool setup:
23 x 12-disk raidz2 glued together.  276 total disks.  Basically each new
12 disk MD1200 is a new raidz2 added to the pool.

Total size: ~797T

We have an identical server which we replicate changes via zfs snapshots
every few minutes.  The whole setup as been up and running for a few years
now, no issues.  As we run low on space we purchase two additional MD1200
shelfs (one for each system) and add the new raidz2 into pool on-the-fly.

The only real issues we've had is sometimes a disk fails in such a way
(think Monty Python and the holy grail i'm not dead yet) where the disk
hasn't failed but is timing out and slows the whole array to a standstill
until we can manual find and remove the disk.  Other problems are once a
disk has been replaced sometimes the resilver process can take
an eternity.  We have also found the snapshot replication process can
interfere with the resilver process - resilver gets stuck at 99% and never
ends - so we end up stopping or only doing one replication a day until the
resilver process is done.

The last helpful hint I have was lowering all the drive timeouts, see
http://everycity.co.uk/alasdair/2011/05/adjusting-drive-timeouts-with-mdb-o=
n-solaris-or-openindiana/
for info.

thanks,
liam


On Sun, Mar 6, 2016 at 10:18 PM, Fred Liu <fred.fliu@gmail.com> wrote:

>
>
> 2016-03-07 14:04 GMT+08:00 Richard Elling <richard.elling@gmail.com>:
>
>>
>> On Mar 6, 2016, at 9:06 PM, Fred Liu <fred.fliu@gmail.com> wrote:
>>
>>
>>
>> 2016-03-06 22:49 GMT+08:00 Richard Elling <
>> richard.elling@richardelling.com>:
>>
>>>
>>> On Mar 3, 2016, at 8:35 PM, Fred Liu <Fred_Liu@issi.com> wrote:
>>>
>>> Hi,
>>>
>>> Today when I was reading Jeff's new nuclear weapon -- DSSD D5's CUBIC
>>> RAID introduction,
>>> the interesting survey -- the zpool with most disks you have ever built
>>> popped in my brain.
>>>
>>>
>>> We test to 2,000 drives. Beyond 2,000 there are some scalability issues
>>> that impact failover times.
>>> We=E2=80=99ve identified these and know what to fix, but need a real cu=
stomer at
>>> this scale to bump it to
>>> the top of the priority queue.
>>>
>>> [Fred]: Wow! 2000 drives almost need 4~5 whole racks!
>>
>>>
>>> For zfs doesn't support nested vdev, the maximum fault tolerance should
>>> be three(from raidz3).
>>>
>>>
>>> Pedantically, it is N, because you can have N-way mirroring.
>>>
>>
>> [Fred]: Yeah. That is just pedantic. N-way mirroring of every disk works
>> in theory and rarely happens in reality.
>>
>>>
>>> It is stranded if you want to build a very huge pool.
>>>
>>>
>>> Scaling redundancy by increasing parity improves data loss protection b=
y
>>> about 3 orders of
>>> magnitude. Adding capacity by striping reduces data loss protection by
>>> 1/N. This is why there is
>>> not much need to go beyond raidz3. However, if you do want to go there,
>>> adding raidz4+ is
>>> relatively easy.
>>>
>>
>> [Fred]: I assume you used stripped raidz3 vedvs in your storage mesh of
>> 2000 drives. If that is true, the possibility of 4/2000 will be not so l=
ow.
>>            Plus, reslivering takes longer time if single disk has bigger
>> capacity. And further, the cost of over-provisioning spare disks vs raid=
z4+
>> will be an deserved
>>             trade-off when the storage mesh at the scale of 2000 drives.
>>
>>
>> Please don't assume, you'll just hurt yourself :-)
>> For example, do not assume the only option is striping across raidz3
>> vdevs. Clearly, there are many
>> different options.
>>
>
> [Fred]:  Yeah. Assumptions always go far way from facts! ;-) Is designing
> a storage mesh with 2000 drives biz secret? Or it is just too complicate =
to
> elaborate?
> Never mind. ;-)
>
> Thanks.
>
> Fred
>
>
>>
>>
> *illumos-zfs* | Archives
> <https://www.listbox.com/member/archive/182191/=3Dnow>
> <https://www.listbox.com/member/archive/rss/182191/25482196-63d208bc> |
> Modify
> <https://www.listbox.com/member/?member_id=3D25482196&id_secret=3D2548219=
6-28027d72>
> Your Subscription <http://www.listbox.com>
>