FreeBSD Mail Archives

Date:      Wed, 24 Oct 2012 19:06:29 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Rick Macklem <rmacklem@uoguelph.ca>
Cc:        freebsd-fs@FreeBSD.org, Ronald Klop <ronald-freebsd8@klop.yi.org>
Subject:   Re: Poor throughput using new NFS client (9.0) vs. old (8.2/9.0)
Message-ID:  <20121024180148.L978@besplex.bde.org>
In-Reply-To: <86699361.2739800.1351035439228.JavaMail.root@erie.cs.uoguelph.ca>
References:  <86699361.2739800.1351035439228.JavaMail.root@erie.cs.uoguelph.ca>

On Tue, 23 Oct 2012, Rick Macklem wrote:

> Thomas Johnson wrote:
>> I built a test image based on 9.1-rc2, per your suggestion Rick. The
>> results are below. I was not able to exactly reproduce the workload in
>> my original message, so I have also included results for the new (very
>> similar) workload on my 9.0 client image as well.
>> ...
>> root@test:/-> mount | grep test
>> server:/array/test on /test (nfs)
>> root@test:/test-> zip BIGGER_PILE.zip BIG_PILE_53*
>> adding: BIG_PILE_5306.zip (stored 0%)
>> adding: BIG_PILE_5378.zip (stored 0%)
>> adding: BIG_PILE_5386.zip (stored 0%)
>> root@test:/test-> ll -h BIGGER_PILE.zip
>> -rw-rw-r-- 1 root claimlynx 5.5M Oct 23 14:05 BIGGER_PILE.zip
>> root@test:/test-> time zip BIGGER_PILE.zip 53*.zip > /dev/null
>> 0.664u 1.693s 0:30.21 7.7% 296+3084k 0+2926io 0pf+0w
>> 0.726u 0.989s 0:08.04 21.1% 230+2667k 0+2956io 0pf+0w
>> 0.829u 1.268s 0:11.89 17.4% 304+3037k 0+2961io 0pf+0w
>> 0.807u 0.902s 0:08.02 21.1% 233+2676k 0+2947io 0pf+0w
>> 0.753u 1.354s 0:12.73 16.4% 279+2879k 0+2947io 0pf+0w
>> root@test:/test-> ll -h BIGGER_PILE.zip
>> -rw-rw-r-- 1 root claimlynx 89M Oct 23 14:03 BIGGER_PILE.zip
>> 
>> [context moved]:
>> root@test:/test-> mount | grep test
>> server:/array/test on /test (oldnfs)
>> root@test:/test-> time zip BIGGER_PILE.zip 53*.zip > /dev/null
>> 0.645u 1.435s 0:08.05 25.7% 295+3044k 0+5299io 0pf+0w
>> 0.783u 0.993s 0:06.48 27.3% 225+2499k 0+5320io 0pf+0w
>> 0.787u 1.000s 0:06.28 28.3% 246+2884k 0+5317io 0pf+0w
>> 0.707u 1.392s 0:07.94 26.3% 266+2743k 0+5313io 0pf+0w
>> 0.709u 1.056s 0:06.08 28.7% 246+2814k 0+5318io 0pf+0w
>>
> Although the runs take much longer (I have no idea why and hopefully
> I can spot something in the packet traces), it shows about half the
> I/O ops.

The variance is also much larger.  oldnfs takes about 27% of the CPU[s]
in all cases, while newnfs takes between 7.7% and 21.1%, with the
difference being mainly due to the extra time taken by newnfs.  It looks
like newnfs is stalling and doing nothing much in the extra time.

> This suggests that it is running at the 64K rsize, wsize
> instead of the 32K used by the old client.

Even 32K is too large for me, but newfs for ffs now defaults to the
same broken 32K.  The comment in sys/param.h still says that the
normal size is 8K, and the buffer cache is still tuned for this size
Sizes larger than 8K are supported up to 16K, at a cost of wasting
up to half of the buffer cache for sizes of 8K (or 31/32 of the buffer
cache for the minimum size of 512 bytes).  Ones larger than 16K
can cause severe buffer cache kva fragmentation.  However, I haven't
seen the expected large performance losses from the fragmentation for
more than 10 years, except possibly with block sizes of 64K and with
mixtures of file systems with very different block sizes.

Even with oldnfs, I saw mysterious dependencies on the block size, 
and almost understood them at one point.  Smaller block sizes tend
to reduce stalls, but when they are too small there are larger
sources of lack of performance.  When stalls occured, I was able
to see them easily for large files (~1GB) by watching network
throughput using netstat -I <interface> 1.  On my low-end hardware,
nfs could saturate the link to not quite achieve the disk i/o speed
of 45-55 MB/S when sending a single large file.  It got within 5% of
that when it didn't stall.  When a stall occured, the network traffic
dropped to almost none for a second or more, and the worst results
were when it stalled for several seconds instead of only 1.  Some
stalls were caused by the server's caches filling up.  Then the
sender must stall since there is nowhere to put its data.  Any
stall reduces throughput, so nfs should try not to write so fast
that stalls occur.  But stalls should only reduce the throughput by
a small percentage, with low variance.  To get the above large
variance from stalls, there must be a problem restarting promptly
after a stall.

> Just to confirm. Did you run a test using the new nfs client
> with rsize=32768,wsize=32768 mount options, so the I/O size is
> the same as with the old client?

I also tested with udp.  udp tends to be faster iff there are no
lost packets, and my LAN rarely loses packets.  With very old nfs
clients, there are different bugs affecting udp and tcp that
give very confusing differences for different packet sizes.
Another detail that I never understood is that rsize != wsize
generally works worse that rsize == wsize, even in the direction
that you think you are optimizing by increasing of decreasing the
size for.

>>>> We
>>>> tend
>>>> to deal with a huge number of tiny files (several KB in size). The
>>>> NFS
>>>> server has been running 9.0 for some time (prior to the client
>>>> upgrade)
>>>> without any issue. NFS is served from a zpool, backed by a Dell
>>>> MD3000,
>>>> populated with 15k SAS disks. Clients and server are connected
>>>> with
>>>> Gig-E
>>>> links. The general hardware configuration has not changed in
>>>> nearly
>>>> 3
>>>> years.

I mainly tested throughput for large files.  For small files, I
optimize for latency instead of throughput by reducing interrupt
moderation as much as possible.  My LAN mostly has (ping) latencies
of 100 usec when undoaded, but I can get this down to 50-60 by
tuning.

>>>> As an example of the performance difference, here is some of the
>>>> testing
>>>> I
>>>> did while troubleshooting. Given a directory containing 5671 zip
>>>> files,
>>>> with an average size of 15KB. I append all files to an existing
>>>> zip
>>>> file.
>>>> Using the newnfs mount, I found that this operation generally
>>>> takes
>>>> ~30
>>>> seconds (wall time). Switching the mount to oldnfs resulted in the
>>>> same
>>>> operation taking ~10 seconds.

Mixed file sizes exercise the buffer cache fragmentation.  I think that
if most are < 16K, buffers of size <= 16K are allocated for them (unless
nfs always allocates its r or w size).  The worst case is if you have
all buffers in use, with each having 16K of kva.  Then to get a 32K
buffer, the system has to free 2 contiguous 16K ones (with the first on
a 32K boundary).  It has to do extra vm searching and vm remapping
operations for this, compared with using 16K buffers throughout -- then
the system just frees the LRU buffer and uses it with its kva mapping
unchanged.  Usually the search succeeds and just takes more CPU, but
sometimes it fails and the system has to sleep waiting for kva.  The
sleep message for this stall is "nbufkv".

In other tests, I see the buffer cache stalling for several seconds
(with high variance) when writing to slow media like dvds.  The buffer
cache certainly fills up in this cases, but it seems suboptimal to
stall for several seconds waiting for a single buffer (I think it means
that all buffers are in use for writing and the disk hardware doesn't
report any completions for several seconds).  Stalling for a much
shorter time more often would be little different for throughput to a
dvd, but better for nfs.

Summary: I think reducing the block size should fix your problem, but
larger block sizes shouldn't work so badly and shouldn't be the default
when they do.

Bruce

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20121024180148.L978>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation