Date: Mon, 10 Jun 2013 04:12:35 -0700 From: Jeremy Chadwick <jdc@koitsu.org> To: Pierre Lemazurier <pierre@lemazurier.fr> Cc: freebsd-fs@freebsd.org Subject: Re: [ZFS] Raid 10 performance issues Message-ID: <20130610111235.GB61858@icarus.home.lan> In-Reply-To: <51B59257.3070500@lemazurier.fr> References: <51B1EBD1.9010207@gmail.com> <51B1F726.7090402@lemazurier.fr> <51B59257.3070500@lemazurier.fr>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Jun 10, 2013 at 10:46:15AM +0200, Pierre Lemazurier wrote: > I add my /boot/loader.conf for more information : >=20 > zfs_load=3D"YES" > vm.kmem_size=3D"22528M" > vfs.zfs.arc_min=3D"20480M" > vfs.zfs.arc_max=3D"20480M" > vfs.zfs.prefetch_disable=3D"0" > vfs.zfs.txg.timeout=3D"5" > vfs.zfs.vdev.max_pending=3D"10" > vfs.zfs.vdev.min_pending=3D"4" > vfs.zfs.write_limit_override=3D"0" > vfs.zfs.no_write_throttle=3D"0" Please remove these variables: vm.kmem_size=3D"22528M" vfs.zfs.arc_min=3D"20480M" You do not need to set vm.kmem_size any longer (that was addressed long ago, during the mid-days of stable/8), and you should let the ARC shrink if need be (my concern here is that possibly limiting the lower end of the ARC size may be triggering some other portions of FreeBSD's VM or ZFS to behave oddly. No proof/evidence, just guesswork on my part). At bare minimum, *definitely* remove the vm.kmem_size setting. Next, please remove the following variables, as these serve no purpose (they are the defaults in 9.1-RELEASE): vfs.zfs.prefetch_disable=3D"0" vfs.zfs.txg.timeout=3D"5" vfs.zfs.vdev.max_pending=3D"10" vfs.zfs.vdev.min_pending=3D"4" vfs.zfs.write_limit_override=3D"0" vfs.zfs.no_write_throttle=3D"0" So in short all you should have in your loader.conf is: zfs_load=3D"yes" vfs.zfs.arc_max=3D"20480M" > Le 07/06/2013 17:07, Pierre Lemazurier a =E9crit : > >Hi, i think i suffer of write and read performance issues on my zpool. > > > >About my system and hardware : > > > >uname -a > >FreeBSD bsdnas 9.1-RELEASE FreeBSD 9.1-RELEASE #0 r243825: Tue Dec 4 > >09:23:10 UTC 2012 > >root@farrell.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC amd64 > > > >sysinfo -a : http://www.privatepaste.com/b32f34c938 Going forward, I would recommend also providing "dmesg". It is a lot easier to read to most of us. All I can work out is that your storage controller is mps(4), except I can't see any of the important details about it. dmesg would give that, not this weird "sysinfo" thing. I would also like to request "pciconf -lvbc" output. > >- 24 (4gbx6) GB DDR3 ECC : > >http://www.ec.kingston.com/ecom/configurator_new/partsinfo.asp?ktcpart= no=3DKVR16R11D8/4HC > > > >- 14x this drive : > >http://www.wdc.com/global/products/specs/?driveID=3D1086&language=3D1 Worth pointing out for readers: These are 4096-byte sector 2TB WD Red drives. > >- server : > >http://www.supermicro.com/products/system/1u/5017/sys-5017r-wrf.cfm?pa= rts=3Dshow > > > >- CPU : > >http://ark.intel.com/fr/products/64594/Intel-Xeon-Processor-E5-2620-15= M-Cache-2_00-GHz-7_20-GTs-Intel-QPI > > > >- chassis : > >http://www.supermicro.com/products/chassis/4u/847/sc847e16-rjbod1.cfm > >- HBA sas connector : > >http://www.lsi.com/products/storagecomponents/Pages/LSISAS9200-8e.aspx > >- Cable between chassis and server : > >http://www.provantage.com/supermicro-cbl-0166l~7SUPA01R.htm > > > >I use this command for test write speed :dd if=3D/dev/zero of=3Dtest.d= d > >bs=3D2M count=3D10000 > >I use this command for test read speed :dd if=3Dtest.dd of=3D/dev/null= bs=3D2M > >count=3D10000 > > > >Of course no compression on zfs dataset. > > > >Test on one of this disk format with UFS : > > > >Write : > >some gstat raising : http://www.privatepaste.com/dd31fafaa6 > >speed around 140 mo/s and something like 1100 iops > >dd result : 20971520000 bytes transferred in 146.722126 secs (14293358= 9 > >bytes/sec) > > > >Read : > >I think I read on RAM (20971520000 bytes transferred in 8.813298 secs > >(2379531480 bytes/sec)). > >Then I make the test on all the drive (dd if=3D/dev/gpt/disk14.nop > >of=3D/dev/null bs=3D2M count=3D10000) > >some gstat raising : http://www.privatepaste.com/d022b7c480 > >speed around 140 mo/s again an near 1100+ iops > >dd reslut : 20971520000 bytes transferred in 142.895212 secs (14676153= 0 > >bytes/sec) Looks about right for a single WD Red 2TB drive. Important: THIS IS A SINGLE DRIVE. > >ZFS - I make my zpool on this way : http://www.privatepaste.com/e74d9c= c3b9 Looks good to me. This is effectively RAID-10 as you said (a stripe of mirrors). > >zpool status : http://www.privatepaste.com/0276801ef6 > >zpool get all : http://www.privatepaste.com/74b37a2429 > >zfs get all : http://www.privatepaste.com/e56f4a33f8 > >zfs-stats -a : http://www.privatepaste.com/f017890aa1 > >zdb : http://www.privatepaste.com/7d723c5556 > > > >With this setup I hope to have near 7x more speed for write and near 1= 4x > >for > >read than the UFS device alone. Then for be realistic, something like > >850 mo/s for write and 1700 mo/s for read. Your hopes may be shattered by the reality of how controllers behave and operate (performance-wise) as well as many other things, including some ZFS tunables. We shall see. > >ZFS =96 test : > > > >Write : > >gstat raising : http://www.privatepaste.com/7cefb9393a > >zpool iostat -v 1 of a fastest try : http://www.privatepaste.com/8ade4= defbe > >dd result : 20971520000 bytes transferred in 54.326509 secs (386027381 > >bytes/sec) > > > >386 mo/s more than twice less than I expect. One thing to be aware of: while the dd took 54 seconds, the I/O to the pool probably continued for long after that. Your average speed to each disk at that time was (just estimating it here) ~55MBytes/second. I would assume what you're seeing above is probably the speed between /dev/zero and the ZFS ARC, with (of course) the controller and driver in the way. We know that your disks can do about 110-140MBytes/second each, so the performance hit has got to be in one of the following places: 1. ZFS itself, 2. Controller, controller driver (mps(4)), or controller firmware, 3. On-die MCH (memory controller) 4. PCIe bus speed limitations or other whatnots. The place to start is with #1, ZFS. See the bottom of my mail for advice. > >Read : > >I export and import the pool for limit the ARC effect. I don't know ho= w > >to do better, I hope that sufficient. You could have checked using "top -b" (before and after export); look for the "ARC" line. I tend to just reboot the system, but export should result in a full pending I/O flush (from ARC, etc.) to all the devices. I would do this and wait about 15 seconds + check with gstat before doing more performance tests. > >gstat raising : http://www.privatepaste.com/130ce43af1 > >zpool iostat -v 1 : http://privatepaste.com/eb5f9d3432 > >dd result : 20971520000 bytes transferred in 30.347214 secs (691052563 > >bytes/sec) > >690 mo/s 2,5x less than I expect. > > > > > >It's appear to not be an hardware issue, when I do a dd test of each > >whole disk at the same time with the command dd if=3D/dev/gpt/diskX > >of=3D/dev/null bs=3D1M count=3D10000, I have this gstat raising : > >http://privatepaste.com/df9f63fd4d > > > >Near 130 mo/s for each device, something like I expect. You're thinking of hardware in too simply a fashion -- if only it were that simple. > >In your opinion where the problem come from ? Not enough information at this time to narrow down where the issue is. Things to try: 1. Start with the initial loader.conf modifications I stated. The vm.kmem_size removal may help. 2. Possibly trying vfs.zfs.no_write_throttle=3D"1" in loader.conf + rebooting + re-doing this test. What that tunable does: https://blogs.oracle.com/roch/entry/the_new_zfs_write_throttle You can also Google "vfs.zfs.no_write_throttle" and see that it's been discussed quite a bit, including some folks saying performance tremendously increases when they set this to 1. =20 3. Given the massive size of your disk array and how much memory you have, you may also want to consider adjusting some of these (possibly increasing vfs.zfs.txg.timeout to make I/O flushing to your disks happen *less* often; I haven't tinkered with the other two): vfs.zfs.txg.timeout=3D"5" vfs.zfs.vdev.max_pending=3D"10" vfs.zfs.vdev.min_pending=3D"4" These also come to mind (these are the defaults): vfs.zfs.write_limit_max=3D"1069071872" vfs.zfs.write_limit_min=3D"33554432" sysctl -d will give you descriptions of these. I have never had to tune any of these, however, but that's also because the pools I've built have consisted of much smaller numbers of disks (3 or 4 at most). I am also used to ahci(4) and have avoided all other controllers for a multitude of reasons (not saying that's the cause of your problem here, just saying that's the stance I've chosen to take). You might also try limiting your ARC maximum (vfs.zfs.arc_max) to something smaller -- say, 8GBytes. See if that has an effect. 4. "sysctl -a | grep zfs" is a very useful piece of information that you should do along with "gstat" and "zpool iostat -v". The counters and information shown there are very, very helpful a lot of the time. There are particular ones that indicate certain performance-hindering scenarios. 5. Your "UFS tests" only tested a single disk, while your ZFS tests tested 14 disks in a RAID-10-like fashion. You could try reproducing the RAID-10 setup using gvinum(8) and use UFS and see what sort of performance you get there. 6. Try re-doing the tests but with less drives involved -- say, 6 instead of 14. See if the throughput to each drive is increased compared to with 14 drives. In general, "profiling" ZFS like this is tricky and requires folks who are very much in-the-know and know how to go about accomplishing this task. Others more familiar with how to do this may need to step up to the plate, but no support/response is guaranteed (if you need that, try Solaris). --=20 | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Making life hard for others since 1977. PGP 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130610111235.GB61858>