From owner-freebsd-fs@FreeBSD.ORG Sat Mar 9 16:27:35 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 50F51FD2; Sat, 9 Mar 2013 16:27:35 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id B9078E61; Sat, 9 Mar 2013 16:27:34 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqEEAE5iO1GDaFvO/2dsb2JhbABDiCi8OIF1dIItAQEBAwEBAQEgBCcgCwUWGAICDRkCKQEJJgYIBwQBHASHbAYMqT2SC4EjjCkKBX00B4ItgRMDiHGLJYI+gR6PVYMoT30IFx4 X-IronPort-AV: E=Sophos;i="4.84,814,1355115600"; d="scan'208";a="17907963" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-annu.net.uoguelph.ca with ESMTP; 09 Mar 2013 11:27:32 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id BBA7BB4036; Sat, 9 Mar 2013 11:27:32 -0500 (EST) Date: Sat, 9 Mar 2013 11:27:32 -0500 (EST) From: Rick Macklem To: Garrett Wollman Message-ID: <1639798917.3728142.1362846452693.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <20794.38381.221980.5038@hergotha.csail.mit.edu> Subject: Re: NFS DRC size MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.202] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: freebsd-fs@freebsd.org, freebsd-net@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 09 Mar 2013 16:27:35 -0000 Garrett Wollman wrote: > < said: > > > The cached replies are copies of the mbuf list done via m_copym(). > > As such, the clusters in these replies won't be free'd (ref cnt -> > > 0) > > until the cache is trimmed (nfsrv_trimcache() gets called after the > > TCP layer has received an ACK for receipt of the reply from the > > client). > > I wonder if this bit is even working at all. In my experience, the > size of the DRC quickly grows under load up to the maximum (or > actually, slightly beyond), and never drops much below that level. On > my production server right now, "nfsstat -se" reports: > Well, once you add the patches and turn vfs.nfsd.tcphighwater up, it will only trim the cache when that highwater mark is exceeded. When it does the trim, the size does drop for the simple testing I do with a single client. (I'll take another look at drc3.patch and see if I can spot anywhere this might be broken, although my hunch is that you have a lot of TCP connections and enough activity that it rapidly grows back up to the limit.) The fact that it trims down to around the highwater mark basically indicates this is working. If it wasn't throwing away replies where the receipt has been ack'd at the TCP level, the cache would grow very large, since they would only be discarded after a loonnngg timeout (12hours unless you've changes NFSRVCACHE_TCPTIMEOUT in sys/fs/nfs/nfs.h). > Server Info: > Getattr Setattr Lookup Readlink Read Write Create Remove > 13036780 359901 1723623 3420 36397693 12385668 346590 109984 > Rename Link Symlink Mkdir Rmdir Readdir RdirPlus Access > 45173 16 116791 14192 1176 24 12876747 3398533 > Mknod Fsstat Fsinfo PathConf Commit LookupP SetClId SetClIdCf > 0 2703 14992 7502 1329196 0 1 1 > Open OpenAttr OpenDwnGr OpenCfrm DelePurge DeleRet GetFH Lock > 263034 0 0 263019 0 0 545104 0 > LockT LockU Close Verify NVerify PutFH PutPubFH PutRootFH > 0 0 263012 0 0 23753375 0 1 > Renew RestoreFH SaveFH Secinfo RelLckOwn V4Create > 2 263006 263033 0 0 0 > Server: > Retfailed Faults Clients > 0 0 1 > OpenOwner Opens LockOwner Locks Delegs > 56 10 0 0 0 > Server Cache Stats: > Inprog Idem Non-idem Misses CacheSize TCPPeak > 0 0 0 81714128 60997 61017 > > It's only been up for about the last 24 hours. Should I be setting > the size limit to something truly outrageous, like 200,000? (I'd > definitely need to deal with the mbuf cluster issue then!) The > average request rate over this time is about 1000/s, but that includes > several episodes of high-cpu spinning (which I resolved by increasing > the DRC limit). > It is the number of TCP connections from clients that determines how much gets cached, not the request rate. For TCP, a scheme like LRU doesn't work, because RPC retries (as opposed to TCP segment retransmits) only happen long after the initial RPC request. (Usually after a TCP connection has broken and the client has established a new connection, although some NFSv3 over TCP clients will retry an RPC after a long timeout.) The cache needs to hold the last N RPC replies for each TCP connection and discard them when further traffic on the TCP connection indicates that the connection is still working. (Some NFSv3 over TCP servers don't guarantee to generate a reply for an RPC when resource constrained, but the FreeBSD one always sends a reply, except for NFSv2, where it will close down the TCP connection when it has no choice. I doubt any client is doing NFSv2 over TCP, so I don't consider this relevent.) If the CPU is spinning in nfsrc_trimcache() a lot, increasing vfs.nfsd.tcphighwater should decrease that, but with an increase in mbuf cluster allocation. If there is a lot of contention for mutexes, increasing the size of the hash table might help. The drc3.patch bumped the hash table from 20->200, but that would still be about 300 entries per hash list and one mutex for those 300 entries, assuming the hash function is working well. Increasing it only adds list head pointers and mutexes. (It's NFSRVCACHE_HASHSIZE in sys/fs/nfs/nfsrvcache.h.) Unfortunately, increasing it requires a kernel rebuild/reboot. Maybe the patch for head should change the size of the hash table when vfs.nfsd.tcphighwater is set much larger? (Not quite trivial and will probably result in a short stall of the nfsd threads, since all the entries will need to be rehashed/moved to new lists, but could be worth the effort.) > Meanwhile, some relevant bits from sysctl: > > vfs.nfsd.udphighwater: 500 > vfs.nfsd.tcphighwater: 61000 > vfs.nfsd.minthreads: 16 > vfs.nfsd.maxthreads: 64 > vfs.nfsd.threads: 64 > vfs.nfsd.request_space_used: 1416 > vfs.nfsd.request_space_used_highest: 4284672 > vfs.nfsd.request_space_high: 47185920 > vfs.nfsd.request_space_low: 31457280 > vfs.nfsd.request_space_throttled: 0 > vfs.nfsd.request_space_throttle_count: 0 > > (I'd actually like to put maxthreads back up at 256, which is where I > had it during testing, but I need to test that the jumbo-frames issue > is fixed first. I did pre-production testing on a non-jumbo network.) > > -GAWollman > Well, the DRC will try to cache replies until the client's TCP layer acknowledges receipt of the reply. It is hard to say how many replies that is for a given TCP connection, since it is a function of the level of concurrently (# of nfsiod threads in the FreeBSD client) in the client. I'd guess it's somewhere between 1<->20? Multiply that by the number of TCP connections from all clients and you have about how big the server's DRC will be. (Some clients use a single TCP connection for the client whereas others use a separate TCP connection for each mount point.) When ivoras@ and I have a patch for head, it should probably allow the DRC to be disabled for TCP mounts (by setting vfs.nfsd.tcphighwater == -1?). I don't really like the idea, but I can see the argument that TCP maintains a reliable enough RPC transport that the DRC isn't needed. rick > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"