Date: Thu, 04 Jul 2013 13:41:02 +1000 From: Lawrence Stewart <lstewart@freebsd.org> To: Outback Dingo <outbackdingo@gmail.com> Cc: Kevin Oberman <rkoberman@gmail.com>, Steven Hartland <killing@multiplay.co.uk>, net@freebsd.org Subject: Re: Terrible ix performance Message-ID: <51D4EECE.4010808@freebsd.org> In-Reply-To: <CAKYr3zyzj=AFcGu62Je3gkZy%2BQP1aDZanYTQp%2BjsMgoWWjrnWA@mail.gmail.com> References: <CAKYr3zyV74DPLsJRuDoRiYsYdAXs=EoqJ6%2B_k4hJiSnwq5zhUQ@mail.gmail.com> <51D3E5BC.1000604@freebsd.org> <CAKYr3zyWzQsFOrQ-MrGTdTzJzhP1kXNac%2BHu8NXfC_J6YJcOsg@mail.gmail.com> <51D42976.9020206@freebsd.org> <CAKYr3zyFF%2BA-OHsEL7t6rdv6Jc4c2ByvvRhV-Fv%2BPXt9Y-sXwg@mail.gmail.com> <E97FF575ED6A405FB13872854191BF3B@multiplay.co.uk> <CAN6yY1vg=KAAaJhG0p8pO6vRwL%2BypHXUfV2Uth70DYNNy04-Uw@mail.gmail.com> <51D4D77B.60804@freebsd.org> <CAKYr3zyzj=AFcGu62Je3gkZy%2BQP1aDZanYTQp%2BjsMgoWWjrnWA@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 07/04/13 13:06, Outback Dingo wrote: > On Wed, Jul 3, 2013 at 10:01 PM, Lawrence Stewart <lstewart@freebsd.org > <mailto:lstewart@freebsd.org>> wrote: > > On 07/04/13 10:18, Kevin Oberman wrote: > > On Wed, Jul 3, 2013 at 4:21 PM, Steven Hartland > <killing@multiplay.co.uk <mailto:killing@multiplay.co.uk>>wrote: [snip] > >> > >> Out of interest have you tried limiting the number of queues? > >> > >> If not give it a try see if it helps, add the following to > >> /boot/loader.conf: > >> hw.ixgbe.num_queues=1 > >> > >> If nothing else will give you another data point. > > As noted in my first post to this thread, if iperf is able to push a > single flow at 8Gbps, then the NIC is unlikely to be the source of the > problem and trying to tune it is a waste of time (at least at this > stage). > > iperf tests memory-network-memory transfer speed without any disk > involvement, so the fact that it can get 8Gbps and ftp is getting around > 4Gbps implies that either the iperf TCP tuning is better (only likely to > be relevant if the RTT is very large - Outback Dingo you still haven't > provided us with the RTT) or the disk subsystem at one or both ends is > slowing things down. > > Outback Dingo: can you please run another iperf test without the -w > switch on both client and server to see if your send/receive window > autotuning on both ends is working. If all is well, you should see the > same results of ~8Gbps. > > >> You might also try SIFTR to analyze the behavior and perhaps even > figure > > out what the limiting factor might be. > > > > kldload siftr > > See "Run-time Configuration" in the siftr(4) man page for details. > > > > I'm a little surprised Lawrence didn't already suggest this as he > is one of > > the authors. (The "Bugs" section is rather long and he might know > that it > > won't be useful in this case, but it has greatly helped me look at > > performance issues.) > > siftr is useful if you suspect a TCP/netstack tuning issue. Given that > iperf gets good results and the OP's tuning settings should be adequate > to achieve good performance if the RTT is low (4MB > sendbuf_max/recvbuf_max), I suspect the disk subsystem and/or VM is more > likely to be the issue i.e. siftr data is probably irrelevant. > > Outback Dingo: Can you confirm you have appropriate tuning on both sides > of the connection? You didn't specify if the loader.conf/sysctl.conf > parameters you provided in the reply to Jack are only on one side of the > connection or both. > > > Yeah i concur, im starting to think the bottleneck is the zpool > > > iperf -i 10 -t 20 -c 10.10.1.11 -l 2.5M > ------------------------------------------------------------ > Client connecting to 10.10.1.11, TCP port 5001 > TCP window size: 257 KByte (default) > ------------------------------------------------------------ > [ 3] local 10.10.1.178 port 47360 connected with 10.10.1.11 port 5001 > [ ID] Interval Transfer Bandwidth > [ 3] 0.0-10.0 sec 9.61 GBytes 8.26 Gbits/sec > [ 3] 10.0-20.0 sec 8.83 GBytes 7.58 Gbits/sec > [ 3] 0.0-20.0 sec 18.4 GBytes 7.92 Gbits/sec > nas4free: /testing # iperf -i 10 -t 20 -c 10.10.1.11 -l 2.5M > ------------------------------------------------------------ > Client connecting to 10.10.1.11, TCP port 5001 > TCP window size: 257 KByte (default) > ------------------------------------------------------------ > [ 3] local 10.10.1.178 port 37691 connected with 10.10.1.11 port 5001 > [ ID] Interval Transfer Bandwidth > [ 3] 0.0-10.0 sec 5.29 GBytes 4.54 Gbits/sec > [ 3] 10.0-20.0 sec 8.06 GBytes 6.93 Gbits/sec > [ 3] 0.0-20.0 sec 13.4 GBytes 5.73 Gbits/sec > nas4free: /testing # iperf -i 10 -t 20 -c 10.10.1.11 -l 2.5M > ------------------------------------------------------------ > Client connecting to 10.10.1.11, TCP port 5001 > TCP window size: 257 KByte (default) > ------------------------------------------------------------ > [ 3] local 10.10.1.178 port 17560 connected with 10.10.1.11 port 5001 > [ ID] Interval Transfer Bandwidth > [ 3] 0.0-10.0 sec 9.48 GBytes 8.14 Gbits/sec > [ 3] 10.0-20.0 sec 8.68 GBytes 7.46 Gbits/sec > [ 3] 0.0-20.0 sec 18.2 GBytes 7.80 Gbits/sec > nas4free: /testing # iperf -i 10 -t 20 -c 10.10.1.11 -l 2.5M > ------------------------------------------------------------ > Client connecting to 10.10.1.11, TCP port 5001 > TCP window size: 257 KByte (default) > ------------------------------------------------------------ > [ 3] local 10.10.1.178 port 14729 connected with 10.10.1.11 port 5001 > [ ID] Interval Transfer Bandwidth > [ 3] 0.0-10.0 sec 7.81 GBytes 6.71 Gbits/sec > [ 3] 10.0-20.0 sec 9.11 GBytes 7.82 Gbits/sec > [ 3] 0.0-20.0 sec 16.9 GBytes 7.27 Gbits/sec Ok. It does seem like your issue is VM/disk related rather than network/protocol related in that case. Going forward, I suggest that you test with FTP as you make tweaks in order to keep things as close to raw TCP bulk transfer as possible but including the disks/VM i.e. don't use NFS/SSH/CIFS to evaluate effectiveness of tuning tweaks. > The current configuration on both boxes is > kernel="kernel" > bootfile="kernel" > kernel_options="" > kern.hz="20000" Why such a high hz setting? I'd suggest lowering to 2000 on both machines unless you have good reason for it to be so high. > hw.est.msr_info="0" > hw.hptrr.attach_generic="0" > kern.maxfiles="65536" > kern.maxfilesperproc="50000" > kern.cam.boot_delay="8000" > autoboot_delay="5" > isboot_load="YES" > zfs_load="YES" > hw.ixgbe.enable_aim=0 > > and > cat /etc/sysctl.conf > # Disable core dump > kern.coredump=0 > # System tuning > net.inet.tcp.delayed_ack=0 > # System tuning > net.inet.tcp.rfc1323=1 > # System tuning > net.inet.tcp.sendspace=262144 > # System tuning > net.inet.tcp.recvspace=262144 > # System tuning > net.inet.tcp.sendbuf_max=4194304 > # System tuning > net.inet.tcp.sendbuf_inc=262144 > # System tuning > net.inet.tcp.sendbuf_auto=1 > # System tuning > net.inet.tcp.recvbuf_max=4194304 > # System tuning > net.inet.tcp.recvbuf_inc=262144 > # System tuning > net.inet.tcp.recvbuf_auto=1 > # System tuning > net.inet.udp.recvspace=65536 > # System tuning > net.inet.udp.maxdgram=57344 > # System tuning > net.local.stream.recvspace=65536 > # System tuning > net.local.stream.sendspace=65536 > # System tuning > kern.ipc.maxsockbuf=16777216 > # System tuning > kern.ipc.somaxconn=8192 > # System tuning > kern.ipc.nmbclusters=262144 > # System tuning > kern.ipc.nmbjumbop=262144 > # System tuning > kern.ipc.nmbjumbo9=131072 > # System tuning > kern.ipc.nmbjumbo16=65536 > # System tuning > kern.maxfiles=65536 > # System tuning > kern.maxfilesperproc=50000 > # System tuning > net.inet.icmp.icmplim=300 > # System tuning > net.inet.icmp.icmplim_output=1 > # System tuning > net.inet.tcp.path_mtu_discovery=0 > # System tuning > hw.intr_storm_threshold=9000 Your network-related tuning looks good to me. > Box A is > zpool status > pool: testing > state: ONLINE > scan: none requested > config: > > NAME STATE READ WRITE CKSUM > testing ONLINE 0 0 0 > da0.nop ONLINE 0 0 0 > da1.nop ONLINE 0 0 0 > da2.nop ONLINE 0 0 0 > da3.nop ONLINE 0 0 0 > da4.nop ONLINE 0 0 0 > da5.nop ONLINE 0 0 0 > da6.nop ONLINE 0 0 0 > da7.nop ONLINE 0 0 0 > da8.nop ONLINE 0 0 0 > da9.nop ONLINE 0 0 0 > da10.nop ONLINE 0 0 0 > da11.nop ONLINE 0 0 0 > da12.nop ONLINE 0 0 0 > da13.nop ONLINE 0 0 0 > da14.nop ONLINE 0 0 0 > da15.nop ONLINE 0 0 0 > > fio --direct=1 --rw=randwrite --bs=4k --size=2G --numjobs=1 --runtime=60 > --group_reporting --name=randwrite > fio: this platform does not support process shared mutexes, forcing use > of threads. Use the 'thread' option to get rid of this warning. > randwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, > iodepth=1 > fio-2.0.15 > Starting 1 process > Jobs: 1 (f=1): [w] [100.0% done] [0K/150.9M/0K /s] [0 /38.7K/0 iops] > [eta 00m:00s] > randwrite: (groupid=0, jobs=1): err= 0: pid=101192: Wed Jul 3 23:01:09 2013 > write: io=2048.0MB, bw=147916KB/s, iops=36978 , runt= 14178msec > clat (usec): min=9 , max=122101 , avg=24.17, stdev=229.23 > lat (usec): min=10 , max=122101 , avg=24.42, stdev=229.23 > clat percentiles (usec): > | 1.00th=[ 11], 5.00th=[ 12], 10.00th=[ 14], 20.00th=[ 21], > | 30.00th=[ 21], 40.00th=[ 22], 50.00th=[ 22], 60.00th=[ 23], > | 70.00th=[ 23], 80.00th=[ 24], 90.00th=[ 29], 95.00th=[ 35], > | 99.00th=[ 99], 99.50th=[ 114], 99.90th=[ 131], 99.95th=[ 137], > | 99.99th=[ 181] > bw (KB/s) : min=58200, max=223112, per=99.93%, avg=147815.61, > stdev=31976.97 > lat (usec) : 10=0.01%, 20=15.49%, 50=82.15%, 100=1.39%, 250=0.96% > lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01% > lat (msec) : 2=0.01%, 20=0.01%, 250=0.01% > cpu : usr=11.05%, sys=87.08%, ctx=563, majf=0, minf=0 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >>=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >>=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >>=64=0.0% > issued : total=r=0/w=524288/d=0, short=r=0/w=0/d=0 > > Run status group 0 (all jobs): > WRITE: io=2048.0MB, aggrb=147915KB/s, minb=147915KB/s, > maxb=147915KB/s, mint=14178msec, maxt=14178msec > fio --direct=1 --rw=randread --bs=4k --size=2G --numjobs=1 --runtime=60 > --group_reporting --name=randread > fio: this platform does not support process shared mutexes, forcing use > of threads. Use the 'thread' option to get rid of this warning. > randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 > fio-2.0.15 > Starting 1 process > randread: Laying out IO file(s) (1 file(s) / 2048MB) > Jobs: 1 (f=1): [r] [100.0% done] [292.9M/0K/0K /s] [74.1K/0 /0 iops] > [eta 00m:00s] > randread: (groupid=0, jobs=1): err= 0: pid=101304: Wed Jul 3 23:02:08 2013 > read : io=2048.0MB, bw=327578KB/s, iops=81894 , runt= 6402msec > clat (usec): min=4 , max=20418 , avg=10.15, stdev=28.54 > lat (usec): min=4 , max=20418 , avg=10.27, stdev=28.54 > clat percentiles (usec): > | 1.00th=[ 5], 5.00th=[ 6], 10.00th=[ 6], 20.00th=[ 8], > | 30.00th=[ 10], 40.00th=[ 10], 50.00th=[ 10], 60.00th=[ 11], > | 70.00th=[ 11], 80.00th=[ 11], 90.00th=[ 12], 95.00th=[ 13], > | 99.00th=[ 22], 99.50th=[ 31], 99.90th=[ 77], 99.95th=[ 95], > | 99.99th=[ 145] > bw (KB/s) : min=290024, max=520016, per=100.00%, avg=328490.00, > stdev=63941.66 > lat (usec) : 10=28.85%, 20=69.83%, 50=1.19%, 100=0.09%, 250=0.05% > lat (msec) : 50=0.01% > cpu : usr=18.08%, sys=81.57%, ctx=144, majf=0, minf=1 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >>=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >>=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >>=64=0.0% > issued : total=r=524288/w=0/d=0, short=r=0/w=0/d=0 > > Run status group 0 (all jobs): > READ: io=2048.0MB, aggrb=327577KB/s, minb=327577KB/s, > maxb=327577KB/s, mint=6402msec, maxt=6402msec > > > Box B > zpool status > pool: backup > state: ONLINE > scan: none requested > config: > > NAME STATE READ WRITE CKSUM > backup ONLINE 0 0 0 > mfid0.nop ONLINE 0 0 0 > mfid1.nop ONLINE 0 0 0 > mfid2.nop ONLINE 0 0 0 > mfid3.nop ONLINE 0 0 0 > mfid4.nop ONLINE 0 0 0 > mfid5.nop ONLINE 0 0 0 > mfid6.nop ONLINE 0 0 0 > mfid7.nop ONLINE 0 0 0 > mfid8.nop ONLINE 0 0 0 > mfid9.nop ONLINE 0 0 0 > mfid10.nop ONLINE 0 0 0 > mfid11.nop ONLINE 0 0 0 > mfid12.nop ONLINE 0 0 0 > mfid13.nop ONLINE 0 0 0 > mfid14.nop ONLINE 0 0 0 > mfid15.nop ONLINE 0 0 0 > mfid16.nop ONLINE 0 0 0 > mfid17.nop ONLINE 0 0 0 > mfid18.nop ONLINE 0 0 0 > mfid19.nop ONLINE 0 0 0 > mfid20.nop ONLINE 0 0 0 > mfid21.nop ONLINE 0 0 0 > mfid22.nop ONLINE 0 0 0 > mfid23.nop ONLINE 0 0 0 > > > > fio --direct=1 --rw=randwrite --bs=4k --size=2G --numjobs=1 --runtime=60 > --group_reporting --name=randwrite > fio: this platform does not support process shared mutexes, forcing use > of threads. Use the 'thread' option to get rid of this warning. > randwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, > iodepth=1 > fio-2.0.15 > Starting 1 process > Jobs: 1 (f=1): [w] [100.0% done] [0K/1948K/0K /s] [0 /487 /0 iops] [eta > 00m:00s] > randwrite: (groupid=0, jobs=1): err= 0: pid=101023: Thu Jul 4 03:03:05 2013 > write: io=65592KB, bw=1093.2KB/s, iops=273 , runt= 60002msec > clat (usec): min=9 , max=157723 , avg=3654.65, stdev=5733.27 > lat (usec): min=9 , max=157724 , avg=3654.98, stdev=5733.29 > clat percentiles (usec): > | 1.00th=[ 12], 5.00th=[ 13], 10.00th=[ 18], 20.00th=[ 23], > | 30.00th=[ 25], 40.00th=[ 740], 50.00th=[ 756], 60.00th=[ 4048], > | 70.00th=[ 5856], 80.00th=[ 7648], 90.00th=[ 9408], 95.00th=[10304], > | 99.00th=[11584], 99.50th=[19072], 99.90th=[96768], 99.95th=[117248], > | 99.99th=[140288] > bw (KB/s) : min= 174, max= 2184, per=99.67%, avg=1089.37, stdev=392.80 > lat (usec) : 10=0.21%, 20=11.34%, 50=25.24%, 100=0.04%, 750=9.51% > lat (usec) : 1000=5.17% > lat (msec) : 2=0.30%, 4=7.89%, 10=33.89%, 20=5.99%, 50=0.28% > lat (msec) : 100=0.05%, 250=0.10% > cpu : usr=0.16%, sys=1.01%, ctx=10488, majf=0, minf=0 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >>=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >>=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >>=64=0.0% > issued : total=r=0/w=16398/d=0, short=r=0/w=0/d=0 > > Run status group 0 (all jobs): > WRITE: io=65592KB, aggrb=1093KB/s, minb=1093KB/s, maxb=1093KB/s, > mint=60002msec, maxt=60002msec > > fio --direct=1 --rw=randread --bs=4k --size=2G --numjobs=1 --runtime=60 > --group_reporting --name=randread > fio: this platform does not support process shared mutexes, forcing use > of threads. Use the 'thread' option to get rid of this warning. > randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 > fio-2.0.15 > Starting 1 process > randread: Laying out IO file(s) (1 file(s) / 2048MB) > Jobs: 1 (f=1): [r] [-.-% done] [608.5M/0K/0K /s] [156K/0 /0 iops] [eta > 00m:00s] > randread: (groupid=0, jobs=1): err= 0: pid=101025: Thu Jul 4 03:04:35 2013 > read : io=2048.0MB, bw=637045KB/s, iops=159261 , runt= 3292msec > clat (usec): min=3 , max=83 , avg= 5.25, stdev= 1.39 > lat (usec): min=3 , max=83 , avg= 5.32, stdev= 1.39 > clat percentiles (usec): > | 1.00th=[ 4], 5.00th=[ 4], 10.00th=[ 5], 20.00th=[ 5], > | 30.00th=[ 5], 40.00th=[ 5], 50.00th=[ 5], 60.00th=[ 5], > | 70.00th=[ 5], 80.00th=[ 6], 90.00th=[ 6], 95.00th=[ 6], > | 99.00th=[ 10], 99.50th=[ 14], 99.90th=[ 22], 99.95th=[ 25], > | 99.99th=[ 45] > bw (KB/s) : min=621928, max=644736, per=99.72%, avg=635281.33, > stdev=10139.68 > lat (usec) : 4=0.05%, 10=98.94%, 20=0.86%, 50=0.14%, 100=0.01% > cpu : usr=14.83%, sys=85.14%, ctx=60, majf=0, minf=1 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >>=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >>=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >>=64=0.0% > issued : total=r=524288/w=0/d=0, short=r=0/w=0/d=0 > > Run status group 0 (all jobs): > READ: io=2048.0MB, aggrb=637044KB/s, minb=637044KB/s, > maxb=637044KB/s, mint=3292msec, maxt=3292msec So if I interpret the above correctly, Box A can crank ~140MB/s random write and ~300MB/s random read and Box B cranks ~1MB/s random write and 630MB/s random read? A few thoughts: - What's up with Box B's 1MB/s write bandwidth? I'm guessing something fired up at the same time as your IO test and killed your random write throughput. - Random read/write is not really a useful test here as ftp is effectively a sequential streaming read/write workload. The random read/write throughput is irrelevant. - I recall some advice that zpool's should not have more than about 8 or 10 disks in them, and you should instead create multiple zpools if you have more disks. Perhaps investigate the source of that rumour and if it's true, try create 2 x 8 disk zpools in Box A and 3 x 8 disk zpools in box B and see if that changes things at all. Cheers, Lawrence
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?51D4EECE.4010808>