From owner-freebsd-fs@FreeBSD.ORG Tue Jun 17 15:47:32 2014 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id BE713D8D for ; Tue, 17 Jun 2014 15:47:32 +0000 (UTC) Received: from btw.pki2.com (btw.pki2.com [IPv6:2001:470:a:6fd::2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 726F82D31 for ; Tue, 17 Jun 2014 15:47:32 +0000 (UTC) Received: from [127.0.0.1] (localhost [127.0.0.1]) by btw.pki2.com (8.14.8/8.14.8) with ESMTP id s5HFlGf2026243; Tue, 17 Jun 2014 08:47:17 -0700 (PDT) (envelope-from dg@pki2.com) Subject: Re: [Fwd: Re: Large ZFS arrays?] From: Dennis Glatting To: Kevin Day In-Reply-To: References: <1402846984.4722.363.camel@btw.pki2.com> Content-Type: text/plain; charset="iso-8859-13" Date: Tue, 17 Jun 2014 08:47:16 -0700 Message-ID: <1403020036.4722.445.camel@btw.pki2.com> Mime-Version: 1.0 X-Mailer: Evolution 2.32.1 FreeBSD GNOME Team Port Content-Transfer-Encoding: 8bit X-SoftwareMunitions-MailScanner-Information: Dennis Glatting X-SoftwareMunitions-MailScanner-ID: s5HFlGf2026243 X-SoftwareMunitions-MailScanner: Found to be clean X-MailScanner-From: dg@pki2.com Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jun 2014 15:47:32 -0000 On Sun, 2014-06-15 at 11:00 -0500, Kevin Day wrote: > On Jun 15, 2014, at 10:43 AM, Dennis Glatting wrote: > > > > Total. I am looking at three pieces in total: > > > > * Two 1PT storage "blocks" providing load sharing and > > mirroring for failover. > > > > * One 5PB storage block for on-line archives (3-5 years). > > > > The 1PB nodes will divided into something that makes sense, such as > > multiple SuperMicro 847 chassis with 3TB disks providing some number of > > volumes. Division is a function of application, such as a 100TB RAIDz2 > > volumes for bulk storage whereas smaller 8TB volumes for active data, > > such as iSCSI, databases, and home directories. > > > > Thanks. > > > We’re currently using multiples of the SuperMicro 847 chassis with 3TB > and 4TB drives, and LSI 9207 controllers. Each 45 drive array is > configured as 4 11 drive raidz2 groups, plus one hot spare. > > A few notes: > > 1) I’d highly recommend against grouping them together into one giant > zpool unless you really really have to. We just spent a lot of time > redoing everything so that each 45 drive array is its own > zpool/filesystem. You’re otherwise putting all your eggs into one very > big basket, and if something went wrong you’d lose everything rather > than just a subset of your data. If you don’t do this, you’ll almost > definitely have to run with sync=disabled, or the number of sync > requests hitting every drive will kill write performance. > > 2) You definitely want a JBOD controller instead of a smart RAID > controller. The LSI 9207 works pretty well, but when you exceed 192 > drives it complains on boot up of running out of heap space and makes > you press a key to continue, which then works fine. There is a very > recently released firmware update for the card that seems to fix this, > but we haven’t completed testing yet. You’ll also want to increase > hw.mps.max_chains. The driver warns you when you need to, but you need > to reboot to change this, and you’re probably only going to discover > this under heavy load. > I had discovered the chains problem on some of my systems. Like most of the people on this list, I have a small data center in my home that the spouse had the noisy servers "relocated" to the garage. :) > 3) We’ve played with L2ARC ssd devices, and aren’t seeing much gains. > It appears that our active data set is so large that it’d need a huge > SSD to even hit a small percentage of our frequently used files. > setting “secondarycache=metadata” does seem to help a bit, but probably > not worth the hassle for us. This probably will depend entirely on your > workload though. > I'm curios if you have you tried the TB or near TB SSDs? I haven't looked to see if they are anything reliable, or fast. > 4) “zfs destroy” can be excruciatingly expensive on large datasets. > http://blog.delphix.com/matt/2012/07/11/performance-of-zfs-destroy/ > It’s a bit better now, but don’t assume you can “zfs destroy” without > killing performance to everything. > Is that still a problem? Both FreeBSD and ZFS-on-Linux had a significant problem on destroy but I am under the impression that is now backgrounded on FreeBSD (ZoL, however, destroyed the pool with dedup data). It's been several months since I deleted TB files but I seem to recall that non-dedup was now good but dedup will forever suck. > If you have specific questions, I’m happy to help, but I think most of > the advice I can offer is going to be workload specific. If I had to do > it all over again, I’d probably break things down into many smaller > servers than trying to put as much onto one. > Replication for on-line fail over. HAST may be an option but I haven't looked into it. -- Dennis Glatting