Date: Mon, 07 Jun 2010 15:42:56 -0700 From: "Bradley W. Dutton" <brad@duttonbros.com> To: freebsd-fs@freebsd.org Subject: ZFS performance of various vdevs (long post) Message-ID: <20100607154256.941428ovaq2hha0g@duttonbros.com>
next in thread | raw e-mail | index | archive | help
Hi, I just upgraded a 5x500 raidz (no NCQ) array to an 8x2tb raidz2 (NCQ) array. In the process I was expecting my new setup to absolutely tear through data due to having faster and additional drives. While the new setup is considerably faster than the old, some of the throughput rates weren't as high as I was expecting. I was hoping I could get some help to understand how ZFS is working or possibly identify some bottlenecks. My goal is to have ZFS on FreeBSD be the best it can. Below are benchmarks of the old 5 drive array (normal/raidz1/raidz2) and raidz2 of the new 8 drive array. As I'm using the new array I can't reformat it to test the other vdev types. Sorry in advance if this format is hard to read. Let me know if I omitted any key information. I did several runs of each of these commands and the results were in range of each other enough that I didn't think any numbers were out of line due to caching. The PC I'm using to test: FreeBSD backup 8.1-PRERELEASE FreeBSD 8.1-PRERELEASE #0: Mon May 24 18:45:38 PDT 2010 root@backup:/usr/obj/usr/src/sys/BACKUP amd64 AMD Athlon X2 5600 4gigs of RAM 5 SATA drives are Western Digital RE2 (7200rpm) using on board controller (NvidiaA nForce 570 SLI MCP, no NCQ): WD5001ABYS (3 of these) WD5000YS (2 of these) Supermicro AOC-USAS-L8i PCI Express x8 controller (with NCQ): 8 Hitachi 2TB 7200rpm drives Relevant /boot/loader.conf settings: vm.kmem_size="3G" vfs.zfs.arc_max="2100M" vfs.zfs.arc_meta_limit="700M" vfs.zfs.prefetch_disable="0" My CPU metrics aren't anything official, just me monitoring top while these commands are running. I mostly kept track of CPU to see if any processes were CPU bound. These are a percentage of total CPU time on the box, so 50% would be 1 core maxxed out. Changing the dd blocksize didn't seem to affect anything so I left it at 1M. Also, if the machine was running for a while and had various items cached in the ARC the speeds could be much slower, as much as half. The first ZFS benchmark was half as fast as the below numbers on a warm box (running for several days), I rebooted to get max speed. The faster numbers weren't due to the data being cached, I observed higher throughput numbers using gstat. Instead of 30Mbytes/sec I would see 60 or 70. The RE2 drives do between 70-80Mbytes/sec sequential reading/writing: #!/bin/sh for disk in "ad4" "ad6" "ad10" "ad12" "ad14" do dd if=/dev/${disk} of=/dev/null bs=1m count=4000 & done 4194304000 bytes transferred in 49.603534 secs (84556556 bytes/sec) 4194304000 bytes transferred in 51.679365 secs (81160130 bytes/sec) 4194304000 bytes transferred in 52.642995 secs (79674494 bytes/sec) 4194304000 bytes transferred in 57.742892 secs (72637581 bytes/sec) 4194304000 bytes transferred in 58.189738 secs (72079789 bytes/sec) CPU usage is low when doing these 5 reads, <10% The Hitachi drives do 120-130Mbytes/sec sequential read/write: #!/bin/sh for disk in "da0" "da1" "da2" "da3" "da4" "da5" "da6" "da7" do dd if=/dev/${disk} of=/dev/null bs=1m count=4000 & done 4194304000 bytes transferred in 31.980469 secs (131152048 bytes/sec) 4194304000 bytes transferred in 32.349440 secs (129656155 bytes/sec) 4194304000 bytes transferred in 32.776024 secs (127968664 bytes/sec) 4194304000 bytes transferred in 32.951440 secs (127287427 bytes/sec) 4194304000 bytes transferred in 33.048651 secs (126913017 bytes/sec) 4194304000 bytes transferred in 33.057686 secs (126878331 bytes/sec) 4194304000 bytes transferred in 33.374149 secs (125675234 bytes/sec) 4194304000 bytes transferred in 35.226584 secs (119066441 bytes/sec) CPU usage is around 25-30% Now on to the ZFS benchmarks: # # a regular ZFS pool for the 5 drive array # zpool create bench /dev/ad4 /dev/ad6 /dev/ad10 /dev/ad12 /dev/ad14 dd if=/dev/zero of=/bench/test.file bs=1m count=12000 12582912000 bytes transferred in 39.687730 secs (317047913 bytes/sec) 30-35% CPU All 5 drives are written to so we have: 317/5 = ~63Mbytes/sec This is close to 70Mbytes/sec so I'm ok with these numbers. I'm not sure how much overhead the checksumming is adding so that could account for the throughput gap here? dd if=/bench/test.file of=/dev/null bs=1m 12582912000 bytes transferred in 34.668165 secs (362952928 bytes/sec) around 30% CPU All 5 drives are read from so we have: 362/5 = ~72Mbytes/sec This seems to be max speed considering the slowest drives in the pool run at this speed. # # a ZFS raidz pool for the 5 drive array # zpool destroy bench zpool create bench raidz /dev/ad4 /dev/ad6 /dev/ad10 /dev/ad12 /dev/ad14 dd if=/dev/zero of=/bench/test.file bs=1m count=12000 12582912000 bytes transferred in 54.357053 secs (231486281 bytes/sec) CPU varied widely, between 30 and 70%, kernel process using most, then dd Only 4 of 5 are writing actual data correct? so we have: 231/4 = ~58Mbytes/sec (this seems to be similar to gstat) We are getting a bit slower here from our reference 70Mbytes/sec and compared to 63 in the regular vdev. dd if=/bench/test.file of=/dev/null bs=1m 12582912000 bytes transferred in 45.825533 secs (274582993 bytes/sec) around 40% CPU, kernel then dd using the most CPU Again only 4 of 5 have data so the throughput is this? 274/4 = ~68Mbytes/sec (looks to be similar to gstat) This is good and close to max speed. # # a ZFS raidz2 pool for the 5 drive array # zpool destroy bench zpool create bench raidz2 /dev/ad4 /dev/ad6 /dev/ad10 /dev/ad12 /dev/ad14 dd if=/dev/zero of=/bench/test.file bs=1m count=12000 12582912000 bytes transferred in 97.491160 secs (129067210 bytes/sec) CPU varied a lot 15-50%, a burst or two to 75% Only 3 of 5 are writing actual data correct? so we have: 129/3 = ~43Mbytes/sec (gstat was varying quite a bit here, as low as 5, as high as 60) These speeds are now quite a bit lower than I would expect. Calculation overhead is causing the discrepancy here? The CPU is too slow? dd if=/bench/test.file of=/dev/null bs=1m 12582912000 bytes transferred in 58.947959 secs (213457976 bytes/sec) around 30% CPU Only 3 of 5 have data and I'm not sure how to calculate throughput. I'm guessing the round robin reads help boost these numbers (read 3 data disks + 1 parity so only 4 of 5 drives are in use for any given read?). gstat shows rates around 40Mbytes/sec even though I would expect closer to 60-70. 213/3 = ~71Mbytes/sec (although I don't think we can do this calculation this way) # # ZFS raidz2 pool on the 8 drive array # this pool is about 15% used so the read/write tests aren't necessarily # on the fastest part of the disks. # zpool create tank raidz2 /dev/da0 /dev/da1 /dev/da2 /dev/da3 /dev/da4 /dev/da5 /dev/da6 /dev/da7 dd if=/dev/zero of=/tank/test.file bs=1m count=12000 12582912000 bytes transferred in 40.878876 secs (307809638 bytes/sec) varying 40-70% CPU (a few bursts into the 90s), kernel then dd using most of it 307/6 = ~51Mbytes/sec (gstat varied quite a big, 20-80, it seems to average in the 50s as dd reported) Per disk this isn't much faster than the old array, 51 compared to 43. With a few bursts to 95% CPU it seems as though some of this could be CPU bound. dd if=/tank/test.file of=/dev/null bs=1m 12582912000 bytes transferred in 32.911291 secs (382328118 bytes/sec) around 55% CPU, mostly kernel then dd Similar to raidz2 test above, I don't think we can calculate throughput this way. In any case, this is actually slower per disk than the old array. 382/6 = ~64Mbytes/sec (gstat seemed to be around 50 so I'm guessing the round robin reading is creating more throughput) # # wrap up # So the normal vdev performs closest to raw drive speeds. Raidz1 is slower and raidz2 even more so. This is observable in the dd tests and viewing in gstat. Any ideas why the raid numbers are slower? I've tried to account for the fact that the raid vdevs have fewer data disks. Would a faster CPU help here? Unfortunately I migrated all of my data to the new array so I can't run all of my tests on there. It would have been nice to see if a normal pool (non raid) on these disks would have come close to max speeds of 120-130Mbytes/sec (giving a total pool through put close to 1Gbyte/sec) as the smaller array did with respect to its max speed. I noticed scrubbing the big array is CPU bound as the kernel process is at 99% when running (total CPU is 50% as the as the scrub doesn't multithread/process). The disks are running around 45-50Mbytes/sec in gstat. Scrubbing the smaller/slower array isn't CPU bound and the disks run at close to max speed. Thanks for time, Brad
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100607154256.941428ovaq2hha0g>