From owner-freebsd-net@FreeBSD.ORG Tue Mar 12 04:30:02 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 325EBD1; Tue, 12 Mar 2013 04:30:02 +0000 (UTC) (envelope-from wollman@hergotha.csail.mit.edu) Received: from hergotha.csail.mit.edu (wollman-1-pt.tunnel.tserv4.nyc4.ipv6.he.net [IPv6:2001:470:1f06:ccb::2]) by mx1.freebsd.org (Postfix) with ESMTP id DE9EE758; Tue, 12 Mar 2013 04:30:01 +0000 (UTC) Received: from hergotha.csail.mit.edu (localhost [127.0.0.1]) by hergotha.csail.mit.edu (8.14.5/8.14.5) with ESMTP id r2C4TxJk080797; Tue, 12 Mar 2013 00:29:59 -0400 (EDT) (envelope-from wollman@hergotha.csail.mit.edu) Received: (from wollman@localhost) by hergotha.csail.mit.edu (8.14.5/8.14.4/Submit) id r2C4Tx1A080794; Tue, 12 Mar 2013 00:29:59 -0400 (EDT) (envelope-from wollman) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <20798.44871.601547.24628@hergotha.csail.mit.edu> Date: Tue, 12 Mar 2013 00:29:59 -0400 From: Garrett Wollman To: Rick Macklem Subject: Re: Limits on jumbo mbuf cluster allocation In-Reply-To: <22122027.3796089.1363051545440.JavaMail.root@erie.cs.uoguelph.ca> References: <201303111605.r2BG5I6v073052@hergotha.csail.mit.edu> <22122027.3796089.1363051545440.JavaMail.root@erie.cs.uoguelph.ca> X-Mailer: VM 7.17 under 21.4 (patch 22) "Instant Classic" XEmacs Lucid X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7 (hergotha.csail.mit.edu [127.0.0.1]); Tue, 12 Mar 2013 00:29:59 -0400 (EDT) X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED autolearn=disabled version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on hergotha.csail.mit.edu Cc: freebsd-net@freebsd.org, andre@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 Mar 2013 04:30:02 -0000 < said: > To be honest, I'd consider seeing a lot of non-empty receive queues > for TCP connections to the NFS server to be an indication that it is > near/at its load limit. (Sure, if you do netstat a lot, you will occasionally > see a non-empty queue here or there, but I would not expect to see a lot > of them non-empty a lot of the time.) If that is the case, then the > question becomes "what is the bottleneck?". Below I suggest getting rid > of the DRC in case it is the bottleneck for your server. The problem is not the DRC in "normal" operation, but the DRC when it gets into the livelocked state. I think we've talked about a number of solutions to the livelock problem, but I haven't managed to implement or test these ideas yet. I have a duplicate server up now, so I hope to do some testing this week. In normal operation, the server is mostly idle, and the nfsd threads that aren't themselves idle are sleeping deep in ZFS waiting for something to happen on disk. When the arrival rate exceeds the rate at which requests are cleared from the DRC, *all* of the nfsd threads will spin, either waiting for the DRC mutex or walking the DRC finding that there is nothing that can be released yet. *That* is the livelock condition -- the spinning that takes over all nfsd threads is what causes the receive buffers to build up, and the large queues then maintain the livelocked condition -- and that is why it clears *immediately* when the DRC size is increased. (It's possible to reproduce this condition on a loaded server by simply reducing the tcphighwater to less than the current size.) Unfortunately, I'm at the NFSRC_FLOODSIZE limit right now (64k), so there is no room for further increases until I recompile the kernel. It's probably a bug that the sysctl definition in drc3.patch doesn't check the new value against this limit. Note that I'm currently running 64 nfsd threads on a 12-core (24-thread) system. In the livelocked condition, as you would expect, the system goes to 100% CPU utilization and the load average peaks out at 64, while goodput goes to nearly nil. > For either A or B, I'd suggest that you disable the DRC for TCP connections > (email if you need a patch for that), which will have a couple of effects: I would like to see your patch, since it's more likely to be correct than one I might dream up. The alternative solution is twofold: first, nfsrv_trimcache() needs to do something to ensure forward progress, even when that means dropping something that hasn't timed out yet, and second, the server code needs to ensure that nfsrv_trimcache() is only executing on one thread at a time. An easy way to do the first part would be to maintain an LRU queue for TCP in addition to the UDP LRU, and just blow away the first N (>NCPU) entries on the queue if, after checking all the TCP replies, the DRC is still larger than the limit. The second part is just an atomic_cmpset_int(). -GAWollman