FreeBSD Mail Archives

Date:      Tue, 17 Jun 2014 08:47:16 -0700
From:      Dennis Glatting <dg@pki2.com>
To:        Kevin Day <toasty@dragondata.com>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: [Fwd: Re: Large ZFS arrays?]
Message-ID:  <1403020036.4722.445.camel@btw.pki2.com>
In-Reply-To: <F071839B-ED6C-4515-B7C1-D7327CEF12B7@dragondata.com>
References:  <1402846984.4722.363.camel@btw.pki2.com> <F071839B-ED6C-4515-B7C1-D7327CEF12B7@dragondata.com>

On Sun, 2014-06-15 at 11:00 -0500, Kevin Day wrote:
> On Jun 15, 2014, at 10:43 AM, Dennis Glatting <dg@pki2.com> wrote:
> > 
> > Total. I am looking at three pieces in total:
> > 
> > * Two 1PT storage "blocks" providing load sharing and 
> >  mirroring for failover.
> > 
> > * One 5PB storage block for on-line archives (3-5 years).
> > 
> > The 1PB nodes will divided into something that makes sense, such as
> > multiple SuperMicro 847 chassis with 3TB disks providing some number of
> > volumes. Division is a function of application, such as a 100TB RAIDz2
> > volumes for bulk storage whereas smaller 8TB volumes for active data,
> > such as iSCSI, databases, and home directories.
> > 
> > Thanks.
> 
> 
> Weÿre currently using multiples of the SuperMicro 847 chassis with 3TB
> and 4TB drives, and LSI 9207 controllers. Each 45 drive array is
> configured as 4 11 drive raidz2 groups, plus one hot spare. 
> 
> A few notes:
> 
> 1) Iÿd highly recommend against grouping them together into one giant
> zpool unless you really really have to. We just spent a lot of time
> redoing everything so that each 45 drive array is its own
> zpool/filesystem. Youÿre otherwise putting all your eggs into one very
> big basket, and if something went wrong youÿd lose everything rather
> than just a subset of your data. If you donÿt do this, youÿll almost
> definitely have to run with sync=disabled, or the number of sync
> requests hitting every drive will kill write performance.
> 
> 2) You definitely want a JBOD controller instead of a smart RAID
> controller. The LSI 9207 works pretty well, but when you exceed 192
> drives it complains on boot up of running out of heap space and makes
> you press a key to continue, which then works fine. There is a very
> recently released firmware update for the card that seems to fix this,
> but we havenÿt completed testing yet. Youÿll also want to increase
> hw.mps.max_chains. The driver warns you when you need to, but you need
> to reboot to change this, and youÿre probably only going to discover
> this under heavy load.
> 

I had discovered the chains problem on some of my systems. Like most of
the people on this list, I have a small data center in my home that the
spouse had the noisy servers "relocated" to the garage. :)


> 3) Weÿve played with L2ARC ssd devices, and arenÿt seeing much gains.
> It appears that our active data set is so large that itÿd need a huge
> SSD to even hit a small percentage of our frequently used files.
> setting ´secondarycache=metadata¡ does seem to help a bit, but probably
> not worth the hassle for us. This probably will depend entirely on your
> workload though.
> 

I'm curios if you have you tried the TB or near TB SSDs? I haven't
looked to see if they are anything reliable, or fast.


> 4) ´zfs destroy¡ can be excruciatingly expensive on large datasets.
> http://blog.delphix.com/matt/2012/07/11/performance-of-zfs-destroy/ 
> Itÿs a bit better now, but donÿt assume you can ´zfs destroy¡ without
> killing performance to everything.
> 

Is that still a problem? Both FreeBSD and ZFS-on-Linux had a significant
problem on destroy but I am under the impression that is now
backgrounded on FreeBSD (ZoL, however, destroyed the pool with dedup
data). It's been several months since I deleted TB files but I seem to
recall that non-dedup was now good but dedup will forever suck.


> If you have specific questions, Iÿm happy to help, but I think most of
> the advice I can offer is going to be workload specific. If I had to do
> it all over again, Iÿd probably break things down into many smaller
> servers than trying to put as much onto one.
> 

Replication for on-line fail over. HAST may be an option but I haven't
looked into it.



-- 
Dennis Glatting <dg@pki2.com>

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1403020036.4722.445.camel>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation