Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 9 Mar 2013 11:27:32 -0500 (EST)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Garrett Wollman <wollman@freebsd.org>
Cc:        freebsd-fs@freebsd.org, freebsd-net@freebsd.org
Subject:   Re: NFS DRC size
Message-ID:  <1639798917.3728142.1362846452693.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <20794.38381.221980.5038@hergotha.csail.mit.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
Garrett Wollman wrote:
> <<On Fri, 8 Mar 2013 19:47:13 -0500 (EST), Rick Macklem
> <rmacklem@uoguelph.ca> said:
> 
> > The cached replies are copies of the mbuf list done via m_copym().
> > As such, the clusters in these replies won't be free'd (ref cnt ->
> > 0)
> > until the cache is trimmed (nfsrv_trimcache() gets called after the
> > TCP layer has received an ACK for receipt of the reply from the
> > client).
> 
> I wonder if this bit is even working at all. In my experience, the
> size of the DRC quickly grows under load up to the maximum (or
> actually, slightly beyond), and never drops much below that level. On
> my production server right now, "nfsstat -se" reports:
> 
Well, once you add the patches and turn vfs.nfsd.tcphighwater up, it
will only trim the cache when that highwater mark is exceeded. When
it does the trim, the size does drop for the simple testing I do with
a single client. (I'll take another look at drc3.patch and see if I
can spot anywhere this might be broken, although my hunch is
that you have a lot of TCP connections and enough activity that it
rapidly grows back up to the limit.) The fact that it trims down to
around the highwater mark basically indicates this is working. If it wasn't
throwing away replies where the receipt has been ack'd at the TCP
level, the cache would grow very large, since they would only be
discarded after a loonnngg timeout (12hours unless you've changes
NFSRVCACHE_TCPTIMEOUT in sys/fs/nfs/nfs.h).

> Server Info:
> Getattr Setattr Lookup Readlink Read Write Create Remove
> 13036780 359901 1723623 3420 36397693 12385668 346590 109984
> Rename Link Symlink Mkdir Rmdir Readdir RdirPlus Access
> 45173 16 116791 14192 1176 24 12876747 3398533
> Mknod Fsstat Fsinfo PathConf Commit LookupP SetClId SetClIdCf
> 0 2703 14992 7502 1329196 0 1 1
> Open OpenAttr OpenDwnGr OpenCfrm DelePurge DeleRet GetFH Lock
> 263034 0 0 263019 0 0 545104 0
> LockT LockU Close Verify NVerify PutFH PutPubFH PutRootFH
> 0 0 263012 0 0 23753375 0 1
> Renew RestoreFH SaveFH Secinfo RelLckOwn V4Create
> 2 263006 263033 0 0 0
> Server:
> Retfailed Faults Clients
> 0 0 1
> OpenOwner Opens LockOwner Locks Delegs
> 56 10 0 0 0
> Server Cache Stats:
> Inprog Idem Non-idem Misses CacheSize TCPPeak
> 0 0 0 81714128 60997 61017
> 
> It's only been up for about the last 24 hours. Should I be setting
> the size limit to something truly outrageous, like 200,000? (I'd
> definitely need to deal with the mbuf cluster issue then!) The
> average request rate over this time is about 1000/s, but that includes
> several episodes of high-cpu spinning (which I resolved by increasing
> the DRC limit).
> 
It is the number of TCP connections from clients that determines how much
gets cached, not the request rate. For TCP, a scheme like LRU doesn't work,
because RPC retries (as opposed to TCP segment retransmits) only happen long
after the initial RPC request. (Usually after a TCP connection has broken and
the client has established a new connection, although some NFSv3 over TCP
clients will retry an RPC after a long timeout.) The cache needs to hold the
last N RPC replies for each TCP connection and discard them when further
traffic on the TCP connection indicates that the connection is still working.
(Some NFSv3 over TCP servers don't guarantee to generate a reply for an RPC
 when resource constrained, but the FreeBSD one always sends a reply, except
 for NFSv2, where it will close down the TCP connection when it has no choice.
 I doubt any client is doing NFSv2 over TCP, so I don't consider this relevent.)

If the CPU is spinning in nfsrc_trimcache() a lot, increasing vfs.nfsd.tcphighwater
should decrease that, but with an increase in mbuf cluster allocation.

If there is a lot of contention for mutexes, increasing the size of the hash
table might help. The drc3.patch bumped the hash table from 20->200,
but that would still be about 300 entries per hash list and one mutex for
those 300 entries, assuming the hash function is working well.
Increasing it only adds list head pointers and mutexes.
(It's NFSRVCACHE_HASHSIZE in sys/fs/nfs/nfsrvcache.h.)

Unfortunately, increasing it requires a kernel rebuild/reboot. Maybe the patch
for head should change the size of the hash table when vfs.nfsd.tcphighwater
is set much larger? (Not quite trivial and will probably result in a short stall of
the nfsd threads, since all the entries will need to be rehashed/moved to
new lists, but could be worth the effort.)

> Meanwhile, some relevant bits from sysctl:
> 
> vfs.nfsd.udphighwater: 500
> vfs.nfsd.tcphighwater: 61000
> vfs.nfsd.minthreads: 16
> vfs.nfsd.maxthreads: 64
> vfs.nfsd.threads: 64
> vfs.nfsd.request_space_used: 1416
> vfs.nfsd.request_space_used_highest: 4284672
> vfs.nfsd.request_space_high: 47185920
> vfs.nfsd.request_space_low: 31457280
> vfs.nfsd.request_space_throttled: 0
> vfs.nfsd.request_space_throttle_count: 0
> 
> (I'd actually like to put maxthreads back up at 256, which is where I
> had it during testing, but I need to test that the jumbo-frames issue
> is fixed first. I did pre-production testing on a non-jumbo network.)
> 
> -GAWollman
> 
Well, the DRC will try to cache replies until the client's TCP layer
acknowledges receipt of the reply. It is hard to say how many replies
that is for a given TCP connection, since it is a function of the level
of concurrently (# of nfsiod threads in the FreeBSD client)
in the client. I'd guess it's somewhere between 1<->20?

Multiply that by the number of TCP connections from all clients and
you have about how big the server's DRC will be. (Some clients use
a single TCP connection for the client whereas others use a separate
TCP connection for each mount point.)

When ivoras@ and I have a patch for head, it should probably allow
the DRC to be disabled for TCP mounts (by setting vfs.nfsd.tcphighwater == -1?).
I don't really like the idea, but I can see the argument that TCP
maintains a reliable enough RPC transport that the DRC isn't needed.

rick

> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1639798917.3728142.1362846452693.JavaMail.root>