Date: Sun, 15 Jun 2014 11:00:24 -0500 From: Kevin Day <toasty@dragondata.com> To: Dennis Glatting <dg@pki2.com> Cc: freebsd-fs@freebsd.org Subject: Re: [Fwd: Re: Large ZFS arrays?] Message-ID: <F071839B-ED6C-4515-B7C1-D7327CEF12B7@dragondata.com> In-Reply-To: <1402846984.4722.363.camel@btw.pki2.com> References: <1402846984.4722.363.camel@btw.pki2.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Jun 15, 2014, at 10:43 AM, Dennis Glatting <dg@pki2.com> wrote: >=20 > Total. I am looking at three pieces in total: >=20 > * Two 1PT storage "blocks" providing load sharing and=20 > mirroring for failover. >=20 > * One 5PB storage block for on-line archives (3-5 years). >=20 > The 1PB nodes will divided into something that makes sense, such as > multiple SuperMicro 847 chassis with 3TB disks providing some number = of > volumes. Division is a function of application, such as a 100TB RAIDz2 > volumes for bulk storage whereas smaller 8TB volumes for active data, > such as iSCSI, databases, and home directories. >=20 > Thanks. We=92re currently using multiples of the SuperMicro 847 chassis with 3TB = and 4TB drives, and LSI 9207 controllers. Each 45 drive array is = configured as 4 11 drive raidz2 groups, plus one hot spare.=20 A few notes: 1) I=92d highly recommend against grouping them together into one giant = zpool unless you really really have to. We just spent a lot of time = redoing everything so that each 45 drive array is its own = zpool/filesystem. You=92re otherwise putting all your eggs into one very = big basket, and if something went wrong you=92d lose everything rather = than just a subset of your data. If you don=92t do this, you=92ll almost = definitely have to run with sync=3Ddisabled, or the number of sync = requests hitting every drive will kill write performance. 2) You definitely want a JBOD controller instead of a smart RAID = controller. The LSI 9207 works pretty well, but when you exceed 192 = drives it complains on boot up of running out of heap space and makes = you press a key to continue, which then works fine. There is a very = recently released firmware update for the card that seems to fix this, = but we haven=92t completed testing yet. You=92ll also want to increase = hw.mps.max_chains. The driver warns you when you need to, but you need = to reboot to change this, and you=92re probably only going to discover = this under heavy load. 3) We=92ve played with L2ARC ssd devices, and aren=92t seeing much = gains. It appears that our active data set is so large that it=92d need = a huge SSD to even hit a small percentage of our frequently used files. = setting =93secondarycache=3Dmetadata=94 does seem to help a bit, but = probably not worth the hassle for us. This probably will depend entirely = on your workload though. 4) =93zfs destroy=94 can be excruciatingly expensive on large datasets. = http://blog.delphix.com/matt/2012/07/11/performance-of-zfs-destroy/ = It=92s a bit better now, but don=92t assume you can =93zfs destroy=94 = without killing performance to everything. If you have specific questions, I=92m happy to help, but I think most of = the advice I can offer is going to be workload specific. If I had to do = it all over again, I=92d probably break things down into many smaller = servers than trying to put as much onto one.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?F071839B-ED6C-4515-B7C1-D7327CEF12B7>