Date: Fri, 24 Jan 2014 06:18:20 -0500 From: J David <j.david.lists@gmail.com> To: Rick Macklem <rmacklem@uoguelph.ca> Cc: freebsd-net@freebsd.org Subject: Re: Terrible NFS performance under 9.2-RELEASE? Message-ID: <CABXB=RT8rPLHuSVkMpne9UirNc2NouT3ki02CKmYXd%2BF15i-qA@mail.gmail.com> In-Reply-To: <CABXB=RSp2qkaNubuqmLFRiB2gaZK6oizmJy%2BmdCSh5xJ25UhTQ@mail.gmail.com> References: <CABXB=RR5NgRpqXA8XQDks3HhxC6QQW3aG%2BsH4BUxFL7kbFuJ_g@mail.gmail.com> <58591523.15519962.1390533482068.JavaMail.root@uoguelph.ca> <CABXB=RRP=N5_sOQrhov_3WVHd3awRjT4-oqd-BUrKgmPPk6fpg@mail.gmail.com> <CABXB=RSp2qkaNubuqmLFRiB2gaZK6oizmJy%2BmdCSh5xJ25UhTQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
INTRODUCTION While researching NFS performance problems in our environment, I ran a series of four iozone tests twice, once with rsize=3D4096,wsize=3D4096 and again with rsize=3D8192,wsize=3D8192. That produced additional avenues of inquiry leading to the identification of two separate potential problems with NFS performance. This message discusses the first problem, apparently extraneous fixed-sized reads before writes. In order to maximize network throughput (>16Gbit/sec) and minimize network latency (<0.1ms), these tests were performed using two FreeBSD guests on the same KVM host node with bridged virtio adapters having TSO and LRO enabled. To remove local or network-backed virtual disks as a bottleneck, a 2GiB memory-backed-UFS-over-NFS filesystem served as the NFS export from the server to the client. BENCHMARK DATA Here is the iozone against the 4k NFS mount: $ iozone -e -I -s 1g -r 4k -i 0 -i 2 Iozone: Performance Test of File I/O Version $Revision: 3.420 $ [...] Include fsync in write timing File size set to 1048576 KB Record Size 4 KB Command line used: iozone -e -s 1g -r 4k -i 0 -i 2 Output is in Kbytes/sec Time Resolution =3D 0.000005 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random random KB reclen write rewrite read reread read write 1048576 4 71167 73714 27525 46349 iozone test complete. Now, here are the read and write columns from nfsstat -s -w 1 over the entire run: Read Write 0 0 0 0 0 0 0 11791 <-- sequential write test begins 0 17215 0 18260 0 17968 0 17810 0 17839 0 17959 0 17912 0 18180 0 18285 0 18636 0 18554 0 18178 0 17361 0 17803 <-- sequential rewrite test begins 0 18358 0 18188 0 18817 0 17757 0 18153 0 18924 0 19444 0 18775 0 18995 0 18198 0 18949 0 17978 0 18879 0 19055 7904 67 <-- random read test begins 7502 0 7194 0 7432 0 6995 0 6844 0 6730 0 6761 0 7011 0 7058 0 7477 0 7139 0 6793 0 7047 0 6402 0 6621 0 7111 0 6911 0 7413 0 7431 0 7047 0 7002 0 7104 0 6987 0 6849 0 6580 0 6268 0 6868 0 6775 0 6335 0 6588 0 6595 0 6587 0 6512 0 6861 0 6953 0 7273 0 5184 1688 <-- random write test begins 0 11795 0 11915 0 11916 0 11838 0 12035 0 11408 0 11780 0 11488 0 11836 0 11787 0 11824 0 12099 0 11863 0 12154 0 11127 0 11434 0 11815 0 11960 0 11510 0 11623 0 11714 0 11896 0 1637 <-- test finished 0 0 0 0 0 0 This looks exactly like you would expect. Now, we re-run the exact same test with rsize/wsize set to 8k: $ iozone -e -I -s 1g -r 4k -i 0 -i 2 Iozone: Performance Test of File I/O Version $Revision: 3.420 $ Compiled for 64 bit mode. Build: freebsd Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone, Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root, Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer, Vangel Bojaxhi, Ben England, Vikentsi Lapa. Run began: Fri Jan 24 07:45:06 2014 Include fsync in write timing File size set to 1048576 KB Record Size 4 KB Command line used: iozone -e -s 1g -r 4k -i 0 -i 2 Output is in Kbytes/sec Time Resolution =3D 0.000005 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random random KB reclen write rewrite read reread read write 1048576 4 124820 34850 35996 16258 iozone test complete. And here are the second-by-second counts of corresponding read/write operations from nfsstat: Read Write 0 0 0 0 0 0 11 10200 <-- sequential write test begins 13 16109 10 15360 12 16055 4 15623 8 15533 9 16241 11 16045 1891 11874 <-- sequential rewrite test begins 4508 4508 4476 4476 4353 4353 4350 4351 4230 4229 4422 4422 4731 4731 4600 4600 4550 4550 4314 4314 4273 4273 4422 4422 4284 4284 4609 4609 4600 4600 4362 4361 4386 4387 4451 4451 4467 4467 4370 4370 4394 4394 4252 4252 4155 4155 4427 4428 4436 4435 4300 4300 3989 3989 4438 4438 4632 4632 6440 1836 <-- random read test begins 6681 0 6416 0 6290 0 6560 0 6569 0 6237 0 6331 0 6403 0 6310 0 6010 0 6023 0 6271 0 6068 0 6044 0 6003 0 5957 0 5683 0 5833 0 5774 0 5708 0 5810 0 5829 0 5787 0 5791 0 5838 0 5922 0 6002 0 6191 0 4162 2624 <-- random write test begins 3998 4068 4056 4100 3929 3969 4164 4128 4188 4173 4256 4164 3899 4068 4018 3933 4077 4114 4210 4222 4050 4116 4157 3983 4066 4122 4129 4162 4303 4199 4235 4412 4100 3932 4056 4254 4192 4186 4204 4120 4085 4138 4103 4146 4272 4193 3943 4028 4161 4163 4153 4131 4218 4017 4132 4233 4195 4101 4041 4145 3994 4086 4297 4193 4205 4341 4101 4103 4134 3962 4297 4408 4282 4242 4180 4175 4286 4216 4217 4397 4364 4253 3673 3720 3843 3807 4147 4183 4171 4181 4280 4224 4126 4158 3977 4074 4146 3919 4147 4362 4079 4060 3755 3760 4157 4130 4087 4109 4006 3873 3860 3967 3982 4048 4146 3963 4188 4203 4040 4063 3976 4046 3859 3815 4114 4193 393 447 <-- test finished 0 0 0 0 0 0 To go through the tests individually... SEQUENTIAL WRITE TEST The first test is sequential write, the ktrace for this test looks like thi= s: 3648 iozone CALL write(0x3,0x801500000,0x1000) 3648 iozone GIO fd 3 wrote 4096 bytes 3648 iozone RET write 4096/0x1000 3648 iozone CALL write(0x3,0x801500000,0x1000) 3648 iozone GIO fd 3 wrote 4096 bytes 3648 iozone RET write 4096/0x1000 It=92s just writing 4k in a loop 262,144 times. This test is substantially faster with wsize=3D8192 than wsize=3D4096 (124,820 KiB/sec at 8k vs 71,167 KiB/sec at 4k). We see from the nfsstat output that this is because writes are being coalesced: there are half as many writes in the 8k test than the 4k test (8k =3D 131,072, 4k =3D 262,144). So far so good. SEQUENTIAL REWRITE TEST But when we move on to the rewrite test, there=92s trouble in River City. Here=92s what it looks like in ktrace: 3648 iozone CALL write(0x3,0x801500000,0x1000) 3648 iozone GIO fd 3 wrote 4096 bytes 3648 iozone RET write 4096/0x1000 3648 iozone CALL write(0x3,0x801500000,0x1000) 3648 iozone GIO fd 3 wrote 4096 bytes 3648 iozone RET write 4096/0x1000 As with the write test, it just writes 4k in a loop 262,144 times. Yet despite being exactly the same sequence of calls, this test is is less than half as fast with 8k NFS as 4k NFS (34,850 KiB/sec at 8k vs 73,714 KiB/sec at 4k). The only difference between this test and the previous one is that now the file exists. And although the 4k NFS client writes it the same way as the first test and therefore gets almost exactly same speed, the 8k NFS client is reading and writing in roughly equal proportion, and drops by a factor of 3.5x from its own performance on the first write test. Note that the total number of writes/sec on the 4k NFS rewrite test hovers around 18,500. On the 8k NFS test, the reads and writes are roughly balanced at around 4242. So the interleaved read/writes performed by the NFS client cut the total IOPs available by more than 50%, twice, leaving less than a quarter available for the application. And these numbers roughly correlate to the observed speeds: 4252 * 8KiB =3D 33,936kiB, 18,500 * 4kiB =3D 74,000 kiB. A total of 131,072 NFS writes are issued during the 8k NFS test for a maximum possible write of 1 GiB. Iozone is writing a total of 1GiB (262,144 x 4kiB). So none of the read data is being used. So those extraneous reads appear to be the source of the performance drop. RANDOM READ TEST Considering the random read test, things look better again. Here=92s what it looks like in ktrace: 3947 iozone CALL lseek(0x3,0x34caa000,SEEK_SET) 3947 iozone RET lseek 885694464/0x34caa000 3947 iozone CALL read(0x3,0x801500000,0x1000) 3947 iozone GIO fd 3 read 4096 bytes 3947 iozone RET read 4096/0x1000 3947 iozone CALL lseek(0x3,0x3aa53000,SEEK_SET) 3947 iozone RET lseek 983904256/0x3aa53000 3947 iozone CALL read(0x3,0x801500000,0x1000) 3947 iozone GIO fd 3 read 4096 bytes 3947 iozone RET read 4096/0x1000 This test lseeks to a random location (on a 4k boundary) then reads 4k, in a loop, 262,144 times. On this test, the 8k NFS moderately outperforms the 4k NFS (35,996 kiB/sec at 8k vs 27,525 kiB/sec at 4k). This is probably attributable to lucky adjacent requests, which are visible in the total number of reads. 4k NFS produces the expected 262,144 NFS read ops, but 8k NFS performs about 176,000, an average of about 6k per read, suggesting it was doing 8k reads of which the whole 8k was useful about 50% of the time. RANDOM WRITE TEST Finally, the random write test. Here=92s the ktrace output from this one: 3262 iozone CALL lseek(0x3,0x29665000,SEEK_SET) 3262 iozone RET lseek 694571008/0x29665000 3262 iozone CALL write(0x3,0x801500000,0x1000) 3262 iozone GIO fd 3 wrote 4096 bytes 3262 iozone RET write 4096/0x1000 3262 iozone CALL lseek(0x3,0x486c000,SEEK_SET) 3262 iozone RET lseek 75939840/0x486c000 3262 iozone CALL write(0x3,0x801500000,0x1000) 3262 iozone GIO fd 3 wrote 4096 bytes 3262 iozone RET write 4096/0x1000 3262 iozone CALL lseek(0x3,0x33310000,SEEK_SET) 3262 iozone RET lseek 858849280/0x33310000 3262 iozone CALL write(0x3,0x801500000,0x1000) 3262 iozone GIO fd 3 wrote 4096 bytes 3262 iozone RET write 4096/0x1000 3262 iozone CALL lseek(0x3,0x2ae3000,SEEK_SET) That=92s all it is, lseek() then write(), repeated 262144 times. No reads, and again, all the lseek=92s are on 4k boundaries. This test is dramatically slower with the 8k NFS, falling from 46,349 kiB/sec for 4k NFS to 16,258 kiB/sec for 8k NFS. Like the rewrite test, the nfsstat output shows an even mix of reads and writes during the 8k NFS run of this test, even though no such writes are ever issued by the client. Unlike the 8k NFS rewrite test, which performs 131,072 NFS write operations, the 8k NFS random write test performs 262,025 operations. This strongly suggests it was converting each 4kiB write into an 8kiB read-modify-write cycle, with the commensurate performance penalty in almost every case, with a handful of packets (about 119) lucky enough to be adjacent writes. Since the NFS server is supposed to be able to handle arbitrary-length writes at arbitrary file offsets, this appears to be unnecessary. To confirm that this is the case, and to determine whether it is doing 4kiB or 8kiB writes, we will have to go to the packets. Here is a sample from the random write test: 09:31:52.338749 IP 172.20.20.169.1877003014 > 172.20.20.162.2049: 128 read fh 1326,127488/4 8192 bytes @ 606494720 09:31:52.338877 IP 172.20.20.162.2049 > 172.20.20.169.1877003014: reply ok 8320 read 09:31:52.338944 IP 172.20.20.169.1877003015 > 172.20.20.162.2049: 4232 write fh 1326,127488/4 4096 (4096) bytes @ 606498816 09:31:52.338964 IP 172.20.20.169.1877003016 > 172.20.20.162.2049: 128 read fh 1326,127488/4 8192 bytes @ 85237760 09:31:52.339037 IP 172.20.20.162.2049 > 172.20.20.169.719: Flags [.], ack 4008446365, win 29118, options [nop,nop,TS val 4241448191 ecr 17929512], length 0 09:31:52.339076 IP 172.20.20.162.2049 > 172.20.20.169.1877003015: reply ok 160 write [|nfs] 09:31:52.339117 IP 172.20.20.162.2049 > 172.20.20.169.1877003016: reply ok 8320 read 09:31:52.339142 IP 172.20.20.169.719 > 172.20.20.162.2049: Flags [.], ack 8488, win 515, options [nop,nop,TS val 17929512 ecr 4241448191], length 0 09:31:52.339183 IP 172.20.20.169.1877003017 > 172.20.20.162.2049: 4232 write fh 1326,127488/4 4096 (4096) bytes @ 85241856 09:31:52.339201 IP 172.20.20.169.1877003018 > 172.20.20.162.2049: 128 read fh 1326,127488/4 8192 bytes @ 100843520 09:31:52.339271 IP 172.20.20.162.2049 > 172.20.20.169.719: Flags [.], ack 4369, win 29118, options [nop,nop,TS val 4241448191 ecr 17929512], length 0 09:31:52.339310 IP 172.20.20.162.2049 > 172.20.20.169.1877003017: reply ok 160 write [|nfs] 09:31:52.339332 IP 172.20.20.162.2049 > 172.20.20.169.1877003018: reply ok 8320 read 09:31:52.339355 IP 172.20.20.169.719 > 172.20.20.162.2049: Flags [.], ack 16976, win 515, options [nop,nop,TS val 17929512 ecr 4241448191], length 0 09:31:52.339408 IP 172.20.20.169.1877003019 > 172.20.20.162.2049: 128 read fh 1326,127488/4 8192 bytes @ 330153984 09:31:52.339514 IP 172.20.20.162.2049 > 172.20.20.169.1877003019: reply ok 8320 read 09:31:52.339562 IP 172.20.20.169.1877003020 > 172.20.20.162.2049: 128 read fh 1326,127488/4 8192 bytes @ 500056064 09:31:52.339669 IP 172.20.20.162.2049 > 172.20.20.169.1877003020: reply ok 8320 read 09:31:52.339728 IP 172.20.20.169.1877003022 > 172.20.20.162.2049: 128 read fh 1326,127488/4 8192 bytes @ 778158080 09:31:52.339758 IP 172.20.20.169.1877003021 > 172.20.20.162.2049: 4232 write fh 1326,127488/4 4096 (4096) bytes @ 500060160 09:31:52.339834 IP 172.20.20.162.2049 > 172.20.20.169.719: Flags [.], ack 9001, win 29118, options [nop,nop,TS val 4241448191 ecr 17929512], length 0 This confirms that the NFS client is indeed performing a read-write cycle. However, they also show that the writes are 4kiB, not 8kiB, so no modify is occurring. E.g. it reads 8kiB at offset 606494720 and then immediately writes the 4kiB from the application at offset 606498816 (which is 606494720+4096). It=92s not clear what the purpose of the read is in this scenario. FOLLOW-UP TESTS To help determine which direction to go next, I conducted two follow-up tests. First, I re-ran the test with the default 64k rsize/wsize values, an effort to confirm that it would do 64k reads followed by 4k writes. It was necessary to do this with a much smaller file size due to the poor performance, and with net.inet.tcp.delacktime=3D10 on both client and server to work around a separate issue with TCP writes >64kiB, but the results are sufficient for this purpose: $ iozone -e -s 32m -r 4k -i 0 -i 2 Iozone: Performance Test of File I/O Version $Revision: 3.420 $ Compiled for 64 bit mode. Build: freebsd Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone, Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root, Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer, Vangel Bojaxhi, Ben England, Vikentsi Lapa. Run began: Fri Jan 24 10:35:14 2014 Include fsync in write timing File size set to 32768 KB Record Size 4 KB Command line used: iozone -e -s 32m -r 4k -i 0 -i 2 Output is in Kbytes/sec Time Resolution =3D 0.000005 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random random bkwd record stride KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread 32768 4 12678 5292 5355 741 iozone test complete. Although the use of TCP writes >64kiB largely mangles tcpdump=92s ability to interpret the NFS messages due to fragmentation, it was possible to grep out enough to see that the suspected 64kiB read / 4kiB write pairs are indeed occurring: 10:36:11.356519 IP 172.20.20.169.1874848588 > 172.20.20.162.2049: 128 read fh 1326,127488/5 65536 bytes @ 18087936 10:36:11.496905 IP 172.20.20.169.1874848606 > 172.20.20.162.2049: 4232 write fh 1326,127488/5 4096 (4096) bytes @ 18096128 As before, the 64kiB read appears not to be used. The result is that roughly 94% of the network I/O spent on a 4kiB write is wasted, with commensurate impact on performance. In an attempt to verify the theory that the data being read is not necessary, the 8k NFS test was retried with a Debian Linux guest. The hypothesis was that if the Debian Linux client did not perform the reads, that would indicate that they were not necessary. The results appeared to support that hypothesis. Not only did the Linux client handily outperform the FreeBSD client on all four tests: # iozone -e -s 1g -r 4k -i 0 -i 2 Iozone: Performance Test of File I/O Version $Revision: 3.397 $ Compiled for 64 bit mode. Build: linux-AMD64 [...] Include fsync in write timing File size set to 1048576 KB Record Size 4 KB Command line used: iozone -e -s 1g -r 4k -i 0 -i 2 Output is in Kbytes/sec Time Resolution =3D 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random random KB reclen write rewrite read reread read write 1048576 4 142393 158101 79471 133974 iozone test complete. But it also did not perform any reads during the rewrite and random write tests, as shown by the server=92s nfsstat output: Read Write 0 0 0 0 0 0 0 1905 <-- sequential write test begins 0 19571 0 20066 0 20214 0 20036 0 20051 0 17221 0 13311 <-- rewrite test begins 0 20721 0 22237 0 21710 0 21619 0 21104 0 18339 6363 4040 <-- random read test begins 8906 0 9885 0 10921 0 12487 0 13311 0 14141 0 16598 0 16178 0 17138 0 17796 0 18765 0 16760 0 2816 6787 <-- random write test begins 0 20892 0 19528 0 20879 0 20758 0 17040 0 21327 0 19713 <-- tests finished 0 0 0 0 0 0 CONCLUSION The rewrite and random write iozone tests appear to demonstrate that in some cases the FreeBSD NFS client treats the rsize/wsize settings as a fixed block size rather than a not-to-exceed size. The data suggests that it requests a full rsize-sized block read from the server before performing a 4kiB write issued by an application, and that the result of the read is not used. Linux does not perform the same read-before-write strategy, and consequently achieves much higher throughput. The next step would probably be to examine the NFS implementation to determine the source of the apparently unnecessary reads. If they could be eliminated, it may lead to performance improvements of up to 20x on some workloads. Thanks!
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CABXB=RT8rPLHuSVkMpne9UirNc2NouT3ki02CKmYXd%2BF15i-qA>