From owner-freebsd-fs@FreeBSD.ORG  Tue Jun 17 15:47:32 2014
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id BE713D8D
 for <freebsd-fs@freebsd.org>; Tue, 17 Jun 2014 15:47:32 +0000 (UTC)
Received: from btw.pki2.com (btw.pki2.com [IPv6:2001:470:a:6fd::2])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 726F82D31
 for <freebsd-fs@freebsd.org>; Tue, 17 Jun 2014 15:47:32 +0000 (UTC)
Received: from [127.0.0.1] (localhost [127.0.0.1])
 by btw.pki2.com (8.14.8/8.14.8) with ESMTP id s5HFlGf2026243;
 Tue, 17 Jun 2014 08:47:17 -0700 (PDT) (envelope-from dg@pki2.com)
Subject: Re: [Fwd: Re: Large ZFS arrays?]
From: Dennis Glatting <dg@pki2.com>
To: Kevin Day <toasty@dragondata.com>
In-Reply-To: <F071839B-ED6C-4515-B7C1-D7327CEF12B7@dragondata.com>
References: <1402846984.4722.363.camel@btw.pki2.com>
 <F071839B-ED6C-4515-B7C1-D7327CEF12B7@dragondata.com>
Content-Type: text/plain; charset="iso-8859-13"
Date: Tue, 17 Jun 2014 08:47:16 -0700
Message-ID: <1403020036.4722.445.camel@btw.pki2.com>
Mime-Version: 1.0
X-Mailer: Evolution 2.32.1 FreeBSD GNOME Team Port 
Content-Transfer-Encoding: 8bit
X-SoftwareMunitions-MailScanner-Information: Dennis Glatting
X-SoftwareMunitions-MailScanner-ID: s5HFlGf2026243
X-SoftwareMunitions-MailScanner: Found to be clean
X-MailScanner-From: dg@pki2.com
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 17 Jun 2014 15:47:32 -0000

On Sun, 2014-06-15 at 11:00 -0500, Kevin Day wrote:
> On Jun 15, 2014, at 10:43 AM, Dennis Glatting <dg@pki2.com> wrote:
> > 
> > Total. I am looking at three pieces in total:
> > 
> > * Two 1PT storage "blocks" providing load sharing and 
> >  mirroring for failover.
> > 
> > * One 5PB storage block for on-line archives (3-5 years).
> > 
> > The 1PB nodes will divided into something that makes sense, such as
> > multiple SuperMicro 847 chassis with 3TB disks providing some number of
> > volumes. Division is a function of application, such as a 100TB RAIDz2
> > volumes for bulk storage whereas smaller 8TB volumes for active data,
> > such as iSCSI, databases, and home directories.
> > 
> > Thanks.
> 
> 
> Weÿre currently using multiples of the SuperMicro 847 chassis with 3TB
> and 4TB drives, and LSI 9207 controllers. Each 45 drive array is
> configured as 4 11 drive raidz2 groups, plus one hot spare. 
> 
> A few notes:
> 
> 1) Iÿd highly recommend against grouping them together into one giant
> zpool unless you really really have to. We just spent a lot of time
> redoing everything so that each 45 drive array is its own
> zpool/filesystem. Youÿre otherwise putting all your eggs into one very
> big basket, and if something went wrong youÿd lose everything rather
> than just a subset of your data. If you donÿt do this, youÿll almost
> definitely have to run with sync=disabled, or the number of sync
> requests hitting every drive will kill write performance.
> 
> 2) You definitely want a JBOD controller instead of a smart RAID
> controller. The LSI 9207 works pretty well, but when you exceed 192
> drives it complains on boot up of running out of heap space and makes
> you press a key to continue, which then works fine. There is a very
> recently released firmware update for the card that seems to fix this,
> but we havenÿt completed testing yet. Youÿll also want to increase
> hw.mps.max_chains. The driver warns you when you need to, but you need
> to reboot to change this, and youÿre probably only going to discover
> this under heavy load.
> 

I had discovered the chains problem on some of my systems. Like most of
the people on this list, I have a small data center in my home that the
spouse had the noisy servers "relocated" to the garage. :)


> 3) Weÿve played with L2ARC ssd devices, and arenÿt seeing much gains.
> It appears that our active data set is so large that itÿd need a huge
> SSD to even hit a small percentage of our frequently used files.
> setting ´secondarycache=metadata¡ does seem to help a bit, but probably
> not worth the hassle for us. This probably will depend entirely on your
> workload though.
> 

I'm curios if you have you tried the TB or near TB SSDs? I haven't
looked to see if they are anything reliable, or fast.


> 4) ´zfs destroy¡ can be excruciatingly expensive on large datasets.
> http://blog.delphix.com/matt/2012/07/11/performance-of-zfs-destroy/ 
> Itÿs a bit better now, but donÿt assume you can ´zfs destroy¡ without
> killing performance to everything.
> 

Is that still a problem? Both FreeBSD and ZFS-on-Linux had a significant
problem on destroy but I am under the impression that is now
backgrounded on FreeBSD (ZoL, however, destroyed the pool with dedup
data). It's been several months since I deleted TB files but I seem to
recall that non-dedup was now good but dedup will forever suck.


> If you have specific questions, Iÿm happy to help, but I think most of
> the advice I can offer is going to be workload specific. If I had to do
> it all over again, Iÿd probably break things down into many smaller
> servers than trying to put as much onto one.
> 

Replication for on-line fail over. HAST may be an option but I haven't
looked into it.


-- 
Dennis Glatting <dg@pki2.com>