From owner-freebsd-current@FreeBSD.ORG Tue Jan 27 20:14:11 2009 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D3DC21065689 for ; Tue, 27 Jan 2009 20:14:11 +0000 (UTC) (envelope-from fjwcash@gmail.com) Received: from smtp.sd73.bc.ca (smtp.sd73.bc.ca [142.24.13.140]) by mx1.freebsd.org (Postfix) with ESMTP id A73C08FC14 for ; Tue, 27 Jan 2009 20:14:11 +0000 (UTC) (envelope-from fjwcash@gmail.com) Received: from localhost (localhost [127.0.0.1]) by localhost.sd73.bc.ca (Postfix) with ESMTP id E7EE71A000B17 for ; Tue, 27 Jan 2009 11:42:43 -0800 (PST) X-Virus-Scanned: Debian amavisd-new at smtp.sd73.bc.ca Received: from coal.localnet (unknown [192.168.0.10]) by smtp.sd73.bc.ca (Postfix) with ESMTP id 83B111A000B16 for ; Tue, 27 Jan 2009 11:42:41 -0800 (PST) From: Freddie Cash To: freebsd-current@freebsd.org Date: Tue, 27 Jan 2009 11:42:41 -0800 User-Agent: KMail/1.10.4 (Linux/2.6.26-1-686; KDE/4.1.4; i686; ; ) References: <01N4NEOEB7LY00EQWX@tmk.com> In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200901271142.41296.fjwcash@gmail.com> Subject: Re: Help me select hardware....Some real world data that might help X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 27 Jan 2009 20:14:12 -0000 On January 27, 2009 10:41 am Paul Tice wrote: > Excuse my rambling, perhaps something in this mess will be useful. > > I'm currently using 8 cores (2x Xeon E5405), 16G FB-DIMM, and 8 x 750GB > drives on a backup system (I plan to add the other in the chassis one by > one, testing the speed along the way) 8-current AMD64, ZFS, Marvell > 88sx6081 PCI-X card (8 port SATA) + LSI1068E (8 port SAS/SATA) for the > main Array, and the Intel onboard SATA for boot drive(s). Data is sucked > down through 3 gigabit ports, with another available but not yet > activated. Array drives all live on the LSI right now. Drives are ST3750640AS K>. > > ZFS is stable _IF_ you disable the prefetch and ZIL, otherwise the > classic ZFS wedge rears it's ugly head. I haven't had a chance to test > just one yet, but I'd guess it's the prefetch that's the quick killer. You probably don't want to disable the ZIL. That's the journal, and an important part of the data integrity setup for ZFS. Prefetch has been shown to cause issues on a lot of systems, and can be a bottleneck depending on the workload. But the ZIL should be enabled. > I've seen references to 8-Current having a kernel memory limit of 8G > (compared to 2G for pre 8 from what I understand so far) and ZFS ARC FreeBSD 8.x kmem_max has been bumped to 512 GB. > Using rsync over several machines with this setup, I'm getting a little > over 1GB/min to the disks. 'zpool iostat 60' is a wonderful tool. gstat is even nicer, as it shows you the throughput to the individual drives, instead of the aggregate that zpool shows. This works at the GEOM level. Quite nice to see how the I/O is balanced (or not) across the drives in the raidz datasets, and the pool as a whole. > CPU usage during all this is suprisingly low. rsync is running with -z, If you are doing rsync over SSH, don't use -z as part of the rsync command. Instead, use -C with ssh. That way, rsync is done in one process, and the compression is done by ssh in another process, and it will use two CPUs/cores instead of just one. You'll get better throughput that way, as the rsync process doesn't have to do the compression and reading/writing in the same process. We got about a 25% boost in throughput by moving the compress out of the rsync, and CPU usage balanced across CPUs instead of just hogging one. > Random ZFS thoughts: > You cannot shrink/grow a raidz or raidz2. You can't add devices to a raidz/raidz2 dataset. But you can replace the drives with larger ones, do a resilver, and the extra space will become available. Just pull the small drive, insert the large drive, and do a "zfs replace ". And you can add extra raidz/raidz2 datasets to a pool, and ZFS will stripe the data across the raidz datasets. Basically, the pool becomes a RAID 5+0 or RAID 6+0, instead of just a RAID 5/RAID 6. If you have lots of drives, the recommendations from the Solaris folks is to use a bunch of raidz datasets comprised of <=9 disks each, instead of one giant raidz dataset across all the drives. ie: zfs create pool raidz2 da0 da1 da2 da3 da4 da5 zfs add pool raidz2 da6 da7 da8 da9 da10 da11 zfs add pool raidz2 da12 da13 da14 da15 da16 da17 Will give you a single pool comprised of three raidz2 datasets, with data being striped across the three datasets. And you can add raidz datasets to the pool as needed. > You can grow a stripe array, > I'm don't know if you can shrink it successfully. You cannot promote a > stripe array to raidz/z2, nor demote in the other direction. You can have > hot spares, haven't seen a provision for warm/cold spares. ZFS in FreeBSD 7.x doesn't support hot spares, in that a faulted drive won't start a rebuild using a spare drive. You have to manually "zfs replace" the drive using the spare. ZFS in FreeBSD 8.x does support auto-rebuild using spare drives (hot spare). > /etc/default/rc.conf already has cron ZFS status/scrub checks, but not > enabled. periodic(8) does ZFS checks as part of the daily run. See /etc/defaults/periodic.conf. However, you can whip up a very simple shell script that does the same, and run it via cron at whatever interval you want. We use the following, that runs every 15 mins: #!/bin/sh status=$( zpool status -x ) if [ "${status}" != "all pools are healthy" ]; then echo "Problems with ZFS: ${status}" | mail -s "ZFS Issues on " \ fi exit 0 -- Freddie fjwcash@gmail.com