Date: Tue, 17 Jun 2014 08:47:16 -0700 From: Dennis Glatting <dg@pki2.com> To: Kevin Day <toasty@dragondata.com> Cc: freebsd-fs@freebsd.org Subject: Re: [Fwd: Re: Large ZFS arrays?] Message-ID: <1403020036.4722.445.camel@btw.pki2.com> In-Reply-To: <F071839B-ED6C-4515-B7C1-D7327CEF12B7@dragondata.com> References: <1402846984.4722.363.camel@btw.pki2.com> <F071839B-ED6C-4515-B7C1-D7327CEF12B7@dragondata.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 2014-06-15 at 11:00 -0500, Kevin Day wrote: > On Jun 15, 2014, at 10:43 AM, Dennis Glatting <dg@pki2.com> wrote: > > > > Total. I am looking at three pieces in total: > > > > * Two 1PT storage "blocks" providing load sharing and > > mirroring for failover. > > > > * One 5PB storage block for on-line archives (3-5 years). > > > > The 1PB nodes will divided into something that makes sense, such as > > multiple SuperMicro 847 chassis with 3TB disks providing some number of > > volumes. Division is a function of application, such as a 100TB RAIDz2 > > volumes for bulk storage whereas smaller 8TB volumes for active data, > > such as iSCSI, databases, and home directories. > > > > Thanks. > > > We’re currently using multiples of the SuperMicro 847 chassis with 3TB > and 4TB drives, and LSI 9207 controllers. Each 45 drive array is > configured as 4 11 drive raidz2 groups, plus one hot spare. > > A few notes: > > 1) I’d highly recommend against grouping them together into one giant > zpool unless you really really have to. We just spent a lot of time > redoing everything so that each 45 drive array is its own > zpool/filesystem. You’re otherwise putting all your eggs into one very > big basket, and if something went wrong you’d lose everything rather > than just a subset of your data. If you don’t do this, you’ll almost > definitely have to run with sync=disabled, or the number of sync > requests hitting every drive will kill write performance. > > 2) You definitely want a JBOD controller instead of a smart RAID > controller. The LSI 9207 works pretty well, but when you exceed 192 > drives it complains on boot up of running out of heap space and makes > you press a key to continue, which then works fine. There is a very > recently released firmware update for the card that seems to fix this, > but we haven’t completed testing yet. You’ll also want to increase > hw.mps.max_chains. The driver warns you when you need to, but you need > to reboot to change this, and you’re probably only going to discover > this under heavy load. > I had discovered the chains problem on some of my systems. Like most of the people on this list, I have a small data center in my home that the spouse had the noisy servers "relocated" to the garage. :) > 3) We’ve played with L2ARC ssd devices, and aren’t seeing much gains. > It appears that our active data set is so large that it’d need a huge > SSD to even hit a small percentage of our frequently used files. > setting “secondarycache=metadata” does seem to help a bit, but probably > not worth the hassle for us. This probably will depend entirely on your > workload though. > I'm curios if you have you tried the TB or near TB SSDs? I haven't looked to see if they are anything reliable, or fast. > 4) “zfs destroy” can be excruciatingly expensive on large datasets. > http://blog.delphix.com/matt/2012/07/11/performance-of-zfs-destroy/ > It’s a bit better now, but don’t assume you can “zfs destroy” without > killing performance to everything. > Is that still a problem? Both FreeBSD and ZFS-on-Linux had a significant problem on destroy but I am under the impression that is now backgrounded on FreeBSD (ZoL, however, destroyed the pool with dedup data). It's been several months since I deleted TB files but I seem to recall that non-dedup was now good but dedup will forever suck. > If you have specific questions, I’m happy to help, but I think most of > the advice I can offer is going to be workload specific. If I had to do > it all over again, I’d probably break things down into many smaller > servers than trying to put as much onto one. > Replication for on-line fail over. HAST may be an option but I haven't looked into it. -- Dennis Glatting <dg@pki2.com>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1403020036.4722.445.camel>