FreeBSD Mail Archives

Date:      Sun, 15 Jun 2014 11:00:24 -0500
From:      Kevin Day <toasty@dragondata.com>
To:        Dennis Glatting <dg@pki2.com>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: [Fwd: Re: Large ZFS arrays?]
Message-ID:  <F071839B-ED6C-4515-B7C1-D7327CEF12B7@dragondata.com>
In-Reply-To: <1402846984.4722.363.camel@btw.pki2.com>
References:  <1402846984.4722.363.camel@btw.pki2.com>

On Jun 15, 2014, at 10:43 AM, Dennis Glatting <dg@pki2.com> wrote:
> 
> Total. I am looking at three pieces in total:
> 
> * Two 1PT storage "blocks" providing load sharing and 
>  mirroring for failover.
> 
> * One 5PB storage block for on-line archives (3-5 years).
> 
> The 1PB nodes will divided into something that makes sense, such as
> multiple SuperMicro 847 chassis with 3TB disks providing some number of
> volumes. Division is a function of application, such as a 100TB RAIDz2
> volumes for bulk storage whereas smaller 8TB volumes for active data,
> such as iSCSI, databases, and home directories.
> 
> Thanks.

We�re currently using multiples of the SuperMicro 847 chassis with 3TB and 4TB drives, and LSI 9207 controllers. Each 45 drive array is configured as 4 11 drive raidz2 groups, plus one hot spare. 

A few notes:

1) I�d highly recommend against grouping them together into one giant zpool unless you really really have to. We just spent a lot of time redoing everything so that each 45 drive array is its own zpool/filesystem. You�re otherwise putting all your eggs into one very big basket, and if something went wrong you�d lose everything rather than just a subset of your data. If you don�t do this, you�ll almost definitely have to run with sync=disabled, or the number of sync requests hitting every drive will kill write performance.

2) You definitely want a JBOD controller instead of a smart RAID controller. The LSI 9207 works pretty well, but when you exceed 192 drives it complains on boot up of running out of heap space and makes you press a key to continue, which then works fine. There is a very recently released firmware update for the card that seems to fix this, but we haven�t completed testing yet. You�ll also want to increase hw.mps.max_chains. The driver warns you when you need to, but you need to reboot to change this, and you�re probably only going to discover this under heavy load.

3) We�ve played with L2ARC ssd devices, and aren�t seeing much gains. It appears that our active data set is so large that it�d need a huge SSD to even hit a small percentage of our frequently used files. setting �secondarycache=metadata� does seem to help a bit, but probably not worth the hassle for us. This probably will depend entirely on your workload though.

4) �zfs destroy� can be excruciatingly expensive on large datasets. http://blog.delphix.com/matt/2012/07/11/performance-of-zfs-destroy/  It�s a bit better now, but don�t assume you can �zfs destroy� without killing performance to everything.

If you have specific questions, I�m happy to help, but I think most of the advice I can offer is going to be workload specific. If I had to do it all over again, I�d probably break things down into many smaller servers than trying to put as much onto one.

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?F071839B-ED6C-4515-B7C1-D7327CEF12B7>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation