From owner-freebsd-net@FreeBSD.ORG  Tue Mar 12 04:30:02 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 325EBD1;
 Tue, 12 Mar 2013 04:30:02 +0000 (UTC)
 (envelope-from wollman@hergotha.csail.mit.edu)
Received: from hergotha.csail.mit.edu
 (wollman-1-pt.tunnel.tserv4.nyc4.ipv6.he.net [IPv6:2001:470:1f06:ccb::2])
 by mx1.freebsd.org (Postfix) with ESMTP id DE9EE758;
 Tue, 12 Mar 2013 04:30:01 +0000 (UTC)
Received: from hergotha.csail.mit.edu (localhost [127.0.0.1])
 by hergotha.csail.mit.edu (8.14.5/8.14.5) with ESMTP id r2C4TxJk080797;
 Tue, 12 Mar 2013 00:29:59 -0400 (EDT)
 (envelope-from wollman@hergotha.csail.mit.edu)
Received: (from wollman@localhost)
 by hergotha.csail.mit.edu (8.14.5/8.14.4/Submit) id r2C4Tx1A080794;
 Tue, 12 Mar 2013 00:29:59 -0400 (EDT) (envelope-from wollman)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <20798.44871.601547.24628@hergotha.csail.mit.edu>
Date: Tue, 12 Mar 2013 00:29:59 -0400
From: Garrett Wollman <wollman@freebsd.org>
To: Rick Macklem <rmacklem@uoguelph.ca>
Subject: Re: Limits on jumbo mbuf cluster allocation
In-Reply-To: <22122027.3796089.1363051545440.JavaMail.root@erie.cs.uoguelph.ca>
References: <201303111605.r2BG5I6v073052@hergotha.csail.mit.edu>
 <22122027.3796089.1363051545440.JavaMail.root@erie.cs.uoguelph.ca>
X-Mailer: VM 7.17 under 21.4 (patch 22) "Instant Classic" XEmacs Lucid
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7
 (hergotha.csail.mit.edu [127.0.0.1]); Tue, 12 Mar 2013 00:29:59 -0400 (EDT)
X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED
 autolearn=disabled version=3.3.2
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on
 hergotha.csail.mit.edu
Cc: freebsd-net@freebsd.org, andre@freebsd.org
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 12 Mar 2013 04:30:02 -0000

<<On Mon, 11 Mar 2013 21:25:45 -0400 (EDT), Rick Macklem <rmacklem@uoguelph.ca> said:

> To be honest, I'd consider seeing a lot of non-empty receive queues
> for TCP connections to the NFS server to be an indication that it is
> near/at its load limit. (Sure, if you do netstat a lot, you will occasionally
> see a non-empty queue here or there, but I would not expect to see a lot
> of them non-empty a lot of the time.) If that is the case, then the
> question becomes "what is the bottleneck?". Below I suggest getting rid
> of the DRC in case it is the bottleneck for your server.

The problem is not the DRC in "normal" operation, but the DRC when it
gets into the livelocked state.  I think we've talked about a number
of solutions to the livelock problem, but I haven't managed to
implement or test these ideas yet.  I have a duplicate server up now,
so I hope to do some testing this week.

In normal operation, the server is mostly idle, and the nfsd threads
that aren't themselves idle are sleeping deep in ZFS waiting for
something to happen on disk.  When the arrival rate exceeds the rate
at which requests are cleared from the DRC, *all* of the nfsd threads
will spin, either waiting for the DRC mutex or walking the DRC finding
that there is nothing that can be released yet.  *That* is the
livelock condition -- the spinning that takes over all nfsd threads is
what causes the receive buffers to build up, and the large queues then
maintain the livelocked condition -- and that is why it clears
*immediately* when the DRC size is increased.  (It's possible to
reproduce this condition on a loaded server by simply reducing the
tcphighwater to less than the current size.)  Unfortunately, I'm at
the NFSRC_FLOODSIZE limit right now (64k), so there is no room for
further increases until I recompile the kernel.  It's probably a bug
that the sysctl definition in drc3.patch doesn't check the new value
against this limit.

Note that I'm currently running 64 nfsd threads on a 12-core
(24-thread) system.  In the livelocked condition, as you would expect,
the system goes to 100% CPU utilization and the load average peaks out
at 64, while goodput goes to nearly nil.

> For either A or B, I'd suggest that you disable the DRC for TCP connections
> (email if you need a patch for that), which will have a couple of effects:

I would like to see your patch, since it's more likely to be correct
than one I might dream up.

The alternative solution is twofold: first, nfsrv_trimcache() needs to
do something to ensure forward progress, even when that means dropping
something that hasn't timed out yet, and second, the server code needs
to ensure that nfsrv_trimcache() is only executing on one thread at a
time.  An easy way to do the first part would be to maintain an LRU
queue for TCP in addition to the UDP LRU, and just blow away the first
N (>NCPU) entries on the queue if, after checking all the TCP replies,
the DRC is still larger than the limit.  The second part is just an
atomic_cmpset_int().

-GAWollman