Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 24 Jan 2014 06:18:20 -0500
From:      J David <j.david.lists@gmail.com>
To:        Rick Macklem <rmacklem@uoguelph.ca>
Cc:        freebsd-net@freebsd.org
Subject:   Re: Terrible NFS performance under 9.2-RELEASE?
Message-ID:  <CABXB=RT8rPLHuSVkMpne9UirNc2NouT3ki02CKmYXd%2BF15i-qA@mail.gmail.com>
In-Reply-To: <CABXB=RSp2qkaNubuqmLFRiB2gaZK6oizmJy%2BmdCSh5xJ25UhTQ@mail.gmail.com>
References:  <CABXB=RR5NgRpqXA8XQDks3HhxC6QQW3aG%2BsH4BUxFL7kbFuJ_g@mail.gmail.com> <58591523.15519962.1390533482068.JavaMail.root@uoguelph.ca> <CABXB=RRP=N5_sOQrhov_3WVHd3awRjT4-oqd-BUrKgmPPk6fpg@mail.gmail.com> <CABXB=RSp2qkaNubuqmLFRiB2gaZK6oizmJy%2BmdCSh5xJ25UhTQ@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
INTRODUCTION

While researching NFS performance problems in our environment, I ran a
series of four iozone tests twice, once with rsize=3D4096,wsize=3D4096 and
again with rsize=3D8192,wsize=3D8192.  That produced additional avenues of
inquiry leading to the identification of two separate potential
problems with NFS performance.  This message discusses the first
problem, apparently extraneous fixed-sized reads before writes.

In order to maximize network throughput (>16Gbit/sec) and minimize
network latency (<0.1ms), these tests were performed using two FreeBSD
guests on the same KVM host node with bridged virtio adapters having
TSO and LRO enabled.  To remove local or network-backed virtual disks
as a bottleneck, a 2GiB memory-backed-UFS-over-NFS filesystem served
as the NFS export from the server to the client.

BENCHMARK DATA

Here is the iozone against the 4k NFS mount:

$ iozone -e -I -s 1g -r 4k -i 0 -i 2
Iozone: Performance Test of File I/O
       Version $Revision: 3.420 $
[...]
Include fsync in write timing
File size set to 1048576 KB
Record Size 4 KB
Command line used: iozone -e -s 1g -r 4k -i 0 -i 2
Output is in Kbytes/sec
Time Resolution =3D 0.000005 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
                                                            random  random
              KB  reclen   write rewrite    read    reread    read   write
         1048576       4   71167   73714                     27525   46349

iozone test complete.

Now, here are the read and write columns from nfsstat -s -w 1 over the
entire run:

Read Write
0 0
0 0
0 0
0 11791  <-- sequential write test begins
0 17215
0 18260
0 17968
0 17810
0 17839
0 17959
0 17912
0 18180
0 18285
0 18636
0 18554
0 18178
0 17361
0 17803 <-- sequential rewrite test begins
0 18358
0 18188
0 18817
0 17757
0 18153
0 18924
0 19444
0 18775
0 18995
0 18198
0 18949
0 17978
0 18879
0 19055
7904 67 <-- random read test begins
7502 0
7194 0
7432 0
6995 0
6844 0
6730 0
6761 0
7011 0
7058 0
7477 0
7139 0
6793 0
7047 0
6402 0
6621 0
7111 0
6911 0
7413 0
7431 0
7047 0
7002 0
7104 0
6987 0
6849 0
6580 0
6268 0
6868 0
6775 0
6335 0
6588 0
6595 0
6587 0
6512 0
6861 0
6953 0
7273 0
5184 1688 <--  random write test begins
0 11795
0 11915
0 11916
0 11838
0 12035
0 11408
0 11780
0 11488
0 11836
0 11787
0 11824
0 12099
0 11863
0 12154
0 11127
0 11434
0 11815
0 11960
0 11510
0 11623
0 11714
0 11896
0 1637 <--  test finished
0 0
0 0
0 0

This looks exactly like you would expect.

Now, we re-run the exact same test with rsize/wsize set to 8k:

$ iozone -e -I -s 1g -r 4k -i 0 -i 2
Iozone: Performance Test of File I/O
       Version $Revision: 3.420 $
Compiled for 64 bit mode.
Build: freebsd

Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
            Al Slater, Scott Rhine, Mike Wisner, Ken Goss
            Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
            Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
            Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
            Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
            Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer,
            Vangel Bojaxhi, Ben England, Vikentsi Lapa.

Run began: Fri Jan 24 07:45:06 2014

Include fsync in write timing
File size set to 1048576 KB
Record Size 4 KB
Command line used: iozone -e -s 1g -r 4k -i 0 -i 2
Output is in Kbytes/sec
Time Resolution =3D 0.000005 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
                                                            random  random
              KB  reclen   write rewrite    read    reread    read   write
         1048576       4  124820   34850                     35996   16258

iozone test complete.

And here are the second-by-second counts of corresponding read/write
operations from nfsstat:

Read Write
0 0
0 0
0 0
11 10200 <--  sequential write test begins
13 16109
10 15360
12 16055
4 15623
8 15533
9 16241
11 16045
1891 11874 <--  sequential rewrite test begins
4508 4508
4476 4476
4353 4353
4350 4351
4230 4229
4422 4422
4731 4731
4600 4600
4550 4550
4314 4314
4273 4273
4422 4422
4284 4284
4609 4609
4600 4600
4362 4361
4386 4387
4451 4451
4467 4467
4370 4370
4394 4394
4252 4252
4155 4155
4427 4428
4436 4435
4300 4300
3989 3989
4438 4438
4632 4632
6440 1836 <--  random read test begins
6681 0
6416 0
6290 0
6560 0
6569 0
6237 0
6331 0
6403 0
6310 0
6010 0
6023 0
6271 0
6068 0
6044 0
6003 0
5957 0
5683 0
5833 0
5774 0
5708 0
5810 0
5829 0
5787 0
5791 0
5838 0
5922 0
6002 0
6191 0
4162 2624 <--  random write test begins
3998 4068
4056 4100
3929 3969
4164 4128
4188 4173
4256 4164
3899 4068
4018 3933
4077 4114
4210 4222
4050 4116
4157 3983
4066 4122
4129 4162
4303 4199
4235 4412
4100 3932
4056 4254
4192 4186
4204 4120
4085 4138
4103 4146
4272 4193
3943 4028
4161 4163
4153 4131
4218 4017
4132 4233
4195 4101
4041 4145
3994 4086
4297 4193
4205 4341
4101 4103
4134 3962
4297 4408
4282 4242
4180 4175
4286 4216
4217 4397
4364 4253
3673 3720
3843 3807
4147 4183
4171 4181
4280 4224
4126 4158
3977 4074
4146 3919
4147 4362
4079 4060
3755 3760
4157 4130
4087 4109
4006 3873
3860 3967
3982 4048
4146 3963
4188 4203
4040 4063
3976 4046
3859 3815
4114 4193
393 447 <--  test finished
0 0
0 0
0 0

To go through the tests individually...

SEQUENTIAL WRITE TEST

The first test is sequential write, the ktrace for this test looks like thi=
s:

  3648 iozone   CALL  write(0x3,0x801500000,0x1000)
  3648 iozone   GIO   fd 3 wrote 4096 bytes
  3648 iozone   RET   write 4096/0x1000
  3648 iozone   CALL  write(0x3,0x801500000,0x1000)
  3648 iozone   GIO   fd 3 wrote 4096 bytes
  3648 iozone   RET   write 4096/0x1000

It=92s just writing 4k in a loop 262,144 times.

This test is substantially faster with wsize=3D8192 than wsize=3D4096
(124,820 KiB/sec at 8k vs 71,167 KiB/sec at 4k).  We see from the
nfsstat output that this is because writes are being coalesced: there
are half as many writes in the 8k test than the 4k test (8k =3D 131,072,
4k =3D 262,144).

So far so good.

SEQUENTIAL REWRITE TEST

But when we move on to the rewrite test, there=92s trouble in River
City.  Here=92s what it looks like in ktrace:

  3648 iozone   CALL  write(0x3,0x801500000,0x1000)
  3648 iozone   GIO   fd 3 wrote 4096 bytes
  3648 iozone   RET   write 4096/0x1000
  3648 iozone   CALL  write(0x3,0x801500000,0x1000)
  3648 iozone   GIO   fd 3 wrote 4096 bytes
  3648 iozone   RET   write 4096/0x1000

As with the write test, it just writes 4k in a loop 262,144 times.

Yet despite being exactly the same sequence of calls, this test is is
less than half as fast with 8k NFS as 4k NFS (34,850 KiB/sec at 8k vs
73,714 KiB/sec at 4k).  The only difference between this test and the
previous one is that now the file exists.  And although the 4k NFS
client writes it the same way as the first test and therefore gets
almost exactly same speed, the 8k NFS client is reading and writing in
roughly equal proportion, and drops by a factor of 3.5x from its own
performance on the first write test.  Note that the total number of
writes/sec on the 4k NFS rewrite test hovers around 18,500.  On the 8k
NFS test, the reads and writes are roughly balanced at around 4242.
So the interleaved read/writes performed by the NFS client cut the
total IOPs available by more than 50%, twice, leaving less than a
quarter available for the application.  And these numbers roughly
correlate to the observed speeds: 4252 * 8KiB =3D 33,936kiB, 18,500 *
4kiB =3D 74,000 kiB.  A total of 131,072 NFS writes are issued during
the 8k NFS test for a maximum possible write of 1 GiB.  Iozone is
writing a total of 1GiB (262,144 x 4kiB).  So none of the read data is
being used.  So those extraneous reads appear to be the source of the
performance drop.

RANDOM READ TEST

Considering the random read test, things look better again.  Here=92s
what it looks like in ktrace:

  3947 iozone   CALL  lseek(0x3,0x34caa000,SEEK_SET)
  3947 iozone   RET   lseek 885694464/0x34caa000
  3947 iozone   CALL  read(0x3,0x801500000,0x1000)
  3947 iozone   GIO   fd 3 read 4096 bytes
  3947 iozone   RET   read 4096/0x1000
  3947 iozone   CALL  lseek(0x3,0x3aa53000,SEEK_SET)
  3947 iozone   RET   lseek 983904256/0x3aa53000
  3947 iozone   CALL  read(0x3,0x801500000,0x1000)
  3947 iozone   GIO   fd 3 read 4096 bytes
  3947 iozone   RET   read 4096/0x1000

This test lseeks to a random location (on a 4k boundary) then reads
4k, in a loop, 262,144 times.

On this test, the 8k NFS moderately outperforms the 4k NFS (35,996
kiB/sec at 8k vs 27,525 kiB/sec at 4k).  This is probably attributable
to lucky adjacent requests, which are visible in the total number of
reads.  4k NFS produces the expected 262,144 NFS read ops, but 8k NFS
performs about 176,000, an average of about 6k per read, suggesting it
was doing 8k reads of which the whole 8k was useful about 50% of the
time.

RANDOM WRITE TEST

Finally, the random write test.  Here=92s the ktrace output from this one:

  3262 iozone   CALL  lseek(0x3,0x29665000,SEEK_SET)
  3262 iozone   RET   lseek 694571008/0x29665000
  3262 iozone   CALL  write(0x3,0x801500000,0x1000)
  3262 iozone   GIO   fd 3 wrote 4096 bytes
  3262 iozone   RET   write 4096/0x1000
  3262 iozone   CALL  lseek(0x3,0x486c000,SEEK_SET)
  3262 iozone   RET   lseek 75939840/0x486c000
  3262 iozone   CALL  write(0x3,0x801500000,0x1000)
  3262 iozone   GIO   fd 3 wrote 4096 bytes
  3262 iozone   RET   write 4096/0x1000
  3262 iozone   CALL  lseek(0x3,0x33310000,SEEK_SET)
  3262 iozone   RET   lseek 858849280/0x33310000
  3262 iozone   CALL  write(0x3,0x801500000,0x1000)
  3262 iozone   GIO   fd 3 wrote 4096 bytes
  3262 iozone   RET   write 4096/0x1000
  3262 iozone   CALL  lseek(0x3,0x2ae3000,SEEK_SET)

That=92s all it is, lseek() then write(), repeated 262144 times.  No
reads, and again, all the lseek=92s are on 4k boundaries.

This test is dramatically slower with the 8k NFS, falling from 46,349
kiB/sec for 4k NFS to 16,258 kiB/sec for 8k NFS.

Like the rewrite test, the nfsstat output shows an even mix of reads
and writes during the 8k NFS run of this test, even though no such
writes are ever issued by the client.  Unlike the 8k NFS rewrite test,
which performs 131,072 NFS write operations, the 8k NFS random write
test performs 262,025 operations.  This strongly suggests it was
converting each 4kiB write into an 8kiB read-modify-write cycle, with
the commensurate performance penalty in almost every case, with a
handful of packets (about 119) lucky enough to be adjacent writes.
Since the NFS server is supposed to be able to handle arbitrary-length
writes at arbitrary file offsets, this appears to be unnecessary.

To confirm that this is the case, and to determine whether it is doing
4kiB or 8kiB writes, we will have to go to the packets.  Here is a
sample from the random write test:

09:31:52.338749 IP 172.20.20.169.1877003014 > 172.20.20.162.2049: 128
read fh 1326,127488/4 8192 bytes @ 606494720
09:31:52.338877 IP 172.20.20.162.2049 > 172.20.20.169.1877003014:
reply ok 8320 read
09:31:52.338944 IP 172.20.20.169.1877003015 > 172.20.20.162.2049: 4232
write fh 1326,127488/4 4096 (4096) bytes @ 606498816
09:31:52.338964 IP 172.20.20.169.1877003016 > 172.20.20.162.2049: 128
read fh 1326,127488/4 8192 bytes @ 85237760
09:31:52.339037 IP 172.20.20.162.2049 > 172.20.20.169.719: Flags [.],
ack 4008446365, win 29118, options [nop,nop,TS val 4241448191 ecr
17929512], length 0
09:31:52.339076 IP 172.20.20.162.2049 > 172.20.20.169.1877003015:
reply ok 160 write [|nfs]
09:31:52.339117 IP 172.20.20.162.2049 > 172.20.20.169.1877003016:
reply ok 8320 read
09:31:52.339142 IP 172.20.20.169.719 > 172.20.20.162.2049: Flags [.],
ack 8488, win 515, options [nop,nop,TS val 17929512 ecr 4241448191],
length 0
09:31:52.339183 IP 172.20.20.169.1877003017 > 172.20.20.162.2049: 4232
write fh 1326,127488/4 4096 (4096) bytes @ 85241856
09:31:52.339201 IP 172.20.20.169.1877003018 > 172.20.20.162.2049: 128
read fh 1326,127488/4 8192 bytes @ 100843520
09:31:52.339271 IP 172.20.20.162.2049 > 172.20.20.169.719: Flags [.],
ack 4369, win 29118, options [nop,nop,TS val 4241448191 ecr 17929512],
length 0
09:31:52.339310 IP 172.20.20.162.2049 > 172.20.20.169.1877003017:
reply ok 160 write [|nfs]
09:31:52.339332 IP 172.20.20.162.2049 > 172.20.20.169.1877003018:
reply ok 8320 read
09:31:52.339355 IP 172.20.20.169.719 > 172.20.20.162.2049: Flags [.],
ack 16976, win 515, options [nop,nop,TS val 17929512 ecr 4241448191],
length 0
09:31:52.339408 IP 172.20.20.169.1877003019 > 172.20.20.162.2049: 128
read fh 1326,127488/4 8192 bytes @ 330153984
09:31:52.339514 IP 172.20.20.162.2049 > 172.20.20.169.1877003019:
reply ok 8320 read
09:31:52.339562 IP 172.20.20.169.1877003020 > 172.20.20.162.2049: 128
read fh 1326,127488/4 8192 bytes @ 500056064
09:31:52.339669 IP 172.20.20.162.2049 > 172.20.20.169.1877003020:
reply ok 8320 read
09:31:52.339728 IP 172.20.20.169.1877003022 > 172.20.20.162.2049: 128
read fh 1326,127488/4 8192 bytes @ 778158080
09:31:52.339758 IP 172.20.20.169.1877003021 > 172.20.20.162.2049: 4232
write fh 1326,127488/4 4096 (4096) bytes @ 500060160
09:31:52.339834 IP 172.20.20.162.2049 > 172.20.20.169.719: Flags [.],
ack 9001, win 29118, options [nop,nop,TS val 4241448191 ecr 17929512],
length 0

This confirms that the NFS client is indeed performing a read-write
cycle.  However, they also show that the writes are 4kiB, not 8kiB, so
no modify is occurring.  E.g. it reads 8kiB at offset 606494720 and
then immediately writes the 4kiB from the application at offset
606498816 (which is 606494720+4096).  It=92s not clear what the purpose
of the read is in this scenario.

FOLLOW-UP TESTS

To help determine which direction to go next, I conducted two
follow-up tests.  First, I re-ran the test with the default 64k
rsize/wsize values, an effort to confirm that it would do 64k reads
followed by 4k writes.  It was necessary to do this with a much
smaller file size due to the poor performance, and with
net.inet.tcp.delacktime=3D10 on both client and server to work around a
separate issue with TCP writes >64kiB, but the results are sufficient
for this purpose:

$ iozone -e -s 32m -r 4k -i 0 -i 2
Iozone: Performance Test of File I/O
       Version $Revision: 3.420 $
Compiled for 64 bit mode.
Build: freebsd

Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
            Al Slater, Scott Rhine, Mike Wisner, Ken Goss
            Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
            Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
            Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
            Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
            Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer,
            Vangel Bojaxhi, Ben England, Vikentsi Lapa.

Run began: Fri Jan 24 10:35:14 2014

Include fsync in write timing
File size set to 32768 KB
Record Size 4 KB
Command line used: iozone -e -s 32m -r 4k -i 0 -i 2
Output is in Kbytes/sec
Time Resolution =3D 0.000005 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
                                                            random
random    bkwd   record   stride
              KB  reclen   write rewrite    read    reread    read
write    read  rewrite     read   fwrite frewrite   fread  freread
           32768       4   12678    5292                      5355
741

iozone test complete.

Although the use of TCP writes >64kiB largely mangles tcpdump=92s
ability to interpret the NFS messages due to fragmentation, it was
possible to grep out enough to see that the suspected 64kiB read /
4kiB write pairs are indeed occurring:

10:36:11.356519 IP 172.20.20.169.1874848588 > 172.20.20.162.2049: 128
read fh 1326,127488/5 65536 bytes @ 18087936
10:36:11.496905 IP 172.20.20.169.1874848606 > 172.20.20.162.2049: 4232
write fh 1326,127488/5 4096 (4096) bytes @ 18096128

As before, the 64kiB read appears not to be used.  The result is that
roughly 94% of the network I/O spent on a 4kiB write is wasted, with
commensurate impact on performance.

In an attempt to verify the theory that the data being read is not
necessary, the 8k NFS test was retried with a Debian Linux guest.  The
hypothesis was that if the Debian Linux client did not perform the
reads, that would indicate that they were not necessary.  The results
appeared to support that hypothesis.  Not only did the Linux client
handily outperform the FreeBSD client on all four tests:

# iozone -e -s 1g -r 4k -i 0 -i 2
Iozone: Performance Test of File I/O
       Version $Revision: 3.397 $
Compiled for 64 bit mode.
Build: linux-AMD64
[...]
Include fsync in write timing
File size set to 1048576 KB
Record Size 4 KB
Command line used: iozone -e -s 1g -r 4k -i 0 -i 2
Output is in Kbytes/sec
Time Resolution =3D 0.000001 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
                                                            random  random
              KB  reclen   write rewrite    read    reread    read   write
         1048576       4  142393  158101                     79471  133974

iozone test complete.

But it also did not perform any reads during the rewrite and random
write tests, as shown by the server=92s nfsstat output:

Read Write
0 0
0 0
0 0
0 1905 <--  sequential write test begins
0 19571
0 20066
0 20214
0 20036
0 20051
0 17221
0 13311 <--  rewrite test begins
0 20721
0 22237
0 21710
0 21619
0 21104
0 18339
6363 4040 <--  random read test begins
8906 0
9885 0
10921 0
12487 0
13311 0
14141 0
16598 0
16178 0
17138 0
17796 0
18765 0
16760 0
2816 6787 <--  random write test begins
0 20892
0 19528
0 20879
0 20758
0 17040
0 21327
0 19713 <--  tests finished
0 0
0 0
0 0


CONCLUSION

The rewrite and random write iozone tests appear to demonstrate that
in some cases the FreeBSD NFS client treats the rsize/wsize settings
as a fixed block size rather than a not-to-exceed size.  The data
suggests that it requests a full rsize-sized block read from the
server before performing a 4kiB write issued by an application, and
that the result of the read is not used.  Linux does not perform the
same read-before-write strategy, and consequently achieves much higher
throughput.

The next step would probably be to examine the NFS implementation to
determine the source of the apparently unnecessary reads.  If they
could be eliminated, it may lead to performance improvements of up to
20x on some workloads.

Thanks!



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CABXB=RT8rPLHuSVkMpne9UirNc2NouT3ki02CKmYXd%2BF15i-qA>