From owner-freebsd-fs@freebsd.org Mon Mar 7 20:55:48 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 368BCAC2594; Mon, 7 Mar 2016 20:55:48 +0000 (UTC) (envelope-from lslusser@gmail.com) Received: from mail-vk0-x22c.google.com (mail-vk0-x22c.google.com [IPv6:2607:f8b0:400c:c05::22c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id DCE601CC6; Mon, 7 Mar 2016 20:55:47 +0000 (UTC) (envelope-from lslusser@gmail.com) Received: by mail-vk0-x22c.google.com with SMTP id c3so131819638vkb.3; Mon, 07 Mar 2016 12:55:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc; bh=DAhuTpXsSg6UCNBJKnagMJx92BEndeEQ0wRXhWRr/9o=; b=kOQWYCAWLGZ/k0dYMPNZDS1CMXMiL7MGjFIsvCSPnuWZcvZos6SqJFhg9l8TA9NxRM RYjf8cHlrYGFoBEOtiJKH8gguoulSrFw8DRGzd0lHvHg+WQ94W/vEALijLQvV3LvGX+S 0f2AwOhVAVDJ0gYnuuPn+g3qTXra1Bk7auF6r52RA634XmvA7erO0tzuUbGBC7OJ34lt QCSBkuy9h8GrLAax23XSiFU5yCH24RHpmsqHTd6JQqEp0eaxlmHF108UTei8H1AIm+k1 7GxNEUqg/FhngQt8lAwSz3geAtIGaNY13agGJshbDMWKp8yjwG88d2V8YMXJRicUNVlz djaQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc; bh=DAhuTpXsSg6UCNBJKnagMJx92BEndeEQ0wRXhWRr/9o=; b=i/tWMkf2vl4Q0DVj+ikWF0O9vhczKg6FWkrxI5e6WD3HgxMOY03yL9yxwJ8JGZfvIn BeYjXA8Maa9wbmqObHZnQsKMEfnCORAMLxziIuzU3rl6rJ8NUs5JOtg5cLVa678PR9XH ZSSSE/rqPVQe6wHlSAPVJ5vE4wCkOlhUOJE7ipoq4vU/oPBmvb8RekQIaGeMjE7h+G7v mkZhpSUpubB6FbTYvDRaJT1e3iATCFS8ZafzL4TPPbf2ptLTEkcD+npoIXsodRh9/fO4 NLuqoH3wk8vlVAv0GC50qVgqUc+x1WSuvV6cI689n8oh59KGlGCXilFQw3VZuEm1nLrS XYJg== X-Gm-Message-State: AD7BkJLIjb4LJEWKPqsH2IZtPmrLHwaLBwEa3Gdw8OkJyxJUSwX4lg+Gm3piTZq0OFV4WGHx4zKlLAZgy8dtVA== MIME-Version: 1.0 X-Received: by 10.31.174.23 with SMTP id x23mr20192994vke.136.1457384146585; Mon, 07 Mar 2016 12:55:46 -0800 (PST) Received: by 10.176.1.166 with HTTP; Mon, 7 Mar 2016 12:55:46 -0800 (PST) In-Reply-To: References: <95563acb-d27b-4d4b-b8f3-afeb87a3d599@me.com> <56D87784.4090103@broken.net> <5158F354-9636-4031-9536-E99450F312B3@RichardElling.com> <6E2B77D1-E0CA-4901-A6BD-6A22C07536B3@gmail.com> Date: Mon, 7 Mar 2016 12:55:46 -0800 Message-ID: Subject: Re: [zfs] [developer] Re: [smartos-discuss] an interesting survey -- the zpool with most disks you have ever built From: Liam Slusser To: zfs@lists.illumos.org Cc: "smartos-discuss@lists.smartos.org" , developer@lists.open-zfs.org, developer , illumos-developer , omnios-discuss , Discussion list for OpenIndiana , "zfs-discuss@list.zfsonlinux.org" , "freebsd-fs@FreeBSD.org" , "zfs-devel@freebsd.org" X-Mailman-Approved-At: Mon, 07 Mar 2016 21:59:51 +0000 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.21 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 07 Mar 2016 20:55:48 -0000 I don't have a 2000 drive array (thats amazing!) but I do have two 280 drive arrays which are in production. Here are the generic stats: server setup: OpenIndiana oi_151 1 server rack Dell r720xd 64g ram with mirrored 250g boot disks 5 x LSI 9207-8e dualport SAS pci-e host bus adapters Intel 10g fibre ethernet (dual port) 2 x SSD for log cache 2 x SSD for cache 23 x Dell MD1200 with 3T,4T, or 6T NLSAS disks (a mix of Toshiba, Western Digital, and Seagate drives - basically whatever Dell sends) zpool setup: 23 x 12-disk raidz2 glued together. 276 total disks. Basically each new 12 disk MD1200 is a new raidz2 added to the pool. Total size: ~797T We have an identical server which we replicate changes via zfs snapshots every few minutes. The whole setup as been up and running for a few years now, no issues. As we run low on space we purchase two additional MD1200 shelfs (one for each system) and add the new raidz2 into pool on-the-fly. The only real issues we've had is sometimes a disk fails in such a way (think Monty Python and the holy grail i'm not dead yet) where the disk hasn't failed but is timing out and slows the whole array to a standstill until we can manual find and remove the disk. Other problems are once a disk has been replaced sometimes the resilver process can take an eternity. We have also found the snapshot replication process can interfere with the resilver process - resilver gets stuck at 99% and never ends - so we end up stopping or only doing one replication a day until the resilver process is done. The last helpful hint I have was lowering all the drive timeouts, see http://everycity.co.uk/alasdair/2011/05/adjusting-drive-timeouts-with-mdb-o= n-solaris-or-openindiana/ for info. thanks, liam On Sun, Mar 6, 2016 at 10:18 PM, Fred Liu wrote: > > > 2016-03-07 14:04 GMT+08:00 Richard Elling : > >> >> On Mar 6, 2016, at 9:06 PM, Fred Liu wrote: >> >> >> >> 2016-03-06 22:49 GMT+08:00 Richard Elling < >> richard.elling@richardelling.com>: >> >>> >>> On Mar 3, 2016, at 8:35 PM, Fred Liu wrote: >>> >>> Hi, >>> >>> Today when I was reading Jeff's new nuclear weapon -- DSSD D5's CUBIC >>> RAID introduction, >>> the interesting survey -- the zpool with most disks you have ever built >>> popped in my brain. >>> >>> >>> We test to 2,000 drives. Beyond 2,000 there are some scalability issues >>> that impact failover times. >>> We=E2=80=99ve identified these and know what to fix, but need a real cu= stomer at >>> this scale to bump it to >>> the top of the priority queue. >>> >>> [Fred]: Wow! 2000 drives almost need 4~5 whole racks! >> >>> >>> For zfs doesn't support nested vdev, the maximum fault tolerance should >>> be three(from raidz3). >>> >>> >>> Pedantically, it is N, because you can have N-way mirroring. >>> >> >> [Fred]: Yeah. That is just pedantic. N-way mirroring of every disk works >> in theory and rarely happens in reality. >> >>> >>> It is stranded if you want to build a very huge pool. >>> >>> >>> Scaling redundancy by increasing parity improves data loss protection b= y >>> about 3 orders of >>> magnitude. Adding capacity by striping reduces data loss protection by >>> 1/N. This is why there is >>> not much need to go beyond raidz3. However, if you do want to go there, >>> adding raidz4+ is >>> relatively easy. >>> >> >> [Fred]: I assume you used stripped raidz3 vedvs in your storage mesh of >> 2000 drives. If that is true, the possibility of 4/2000 will be not so l= ow. >> Plus, reslivering takes longer time if single disk has bigger >> capacity. And further, the cost of over-provisioning spare disks vs raid= z4+ >> will be an deserved >> trade-off when the storage mesh at the scale of 2000 drives. >> >> >> Please don't assume, you'll just hurt yourself :-) >> For example, do not assume the only option is striping across raidz3 >> vdevs. Clearly, there are many >> different options. >> > > [Fred]: Yeah. Assumptions always go far way from facts! ;-) Is designing > a storage mesh with 2000 drives biz secret? Or it is just too complicate = to > elaborate? > Never mind. ;-) > > Thanks. > > Fred > > >> >> > *illumos-zfs* | Archives > > | > Modify > > Your Subscription >