From owner-freebsd-net@FreeBSD.ORG  Sat Mar  9 17:34:51 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id C9184346
 for <freebsd-net@freebsd.org>; Sat,  9 Mar 2013 17:34:51 +0000 (UTC)
 (envelope-from wollman@hergotha.csail.mit.edu)
Received: from hergotha.csail.mit.edu
 (wollman-1-pt.tunnel.tserv4.nyc4.ipv6.he.net [IPv6:2001:470:1f06:ccb::2])
 by mx1.freebsd.org (Postfix) with ESMTP id 7F7031AD
 for <freebsd-net@freebsd.org>; Sat,  9 Mar 2013 17:34:51 +0000 (UTC)
Received: from hergotha.csail.mit.edu (localhost [127.0.0.1])
 by hergotha.csail.mit.edu (8.14.5/8.14.5) with ESMTP id r29HYo0R061832;
 Sat, 9 Mar 2013 12:34:50 -0500 (EST)
 (envelope-from wollman@hergotha.csail.mit.edu)
Received: (from wollman@localhost)
 by hergotha.csail.mit.edu (8.14.5/8.14.4/Submit) id r29HYohJ061829;
 Sat, 9 Mar 2013 12:34:50 -0500 (EST) (envelope-from wollman)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <20795.29370.194678.963351@hergotha.csail.mit.edu>
Date: Sat, 9 Mar 2013 12:34:50 -0500
From: Garrett Wollman <wollman@freebsd.org>
To: Rick Macklem <rmacklem@uoguelph.ca>
Subject: Re: Limits on jumbo mbuf cluster allocation
In-Reply-To: <1700261042.3728432.1362847830447.JavaMail.root@erie.cs.uoguelph.ca>
References: <20794.37617.822910.93537@hergotha.csail.mit.edu>
 <1700261042.3728432.1362847830447.JavaMail.root@erie.cs.uoguelph.ca>
X-Mailer: VM 7.17 under 21.4 (patch 22) "Instant Classic" XEmacs Lucid
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7
 (hergotha.csail.mit.edu [127.0.0.1]); Sat, 09 Mar 2013 12:34:50 -0500 (EST)
X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED
 autolearn=disabled version=3.3.2
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on
 hergotha.csail.mit.edu
Cc: freebsd-net@freebsd.org
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 09 Mar 2013 17:34:51 -0000

<<On Sat, 9 Mar 2013 11:50:30 -0500 (EST), Rick Macklem <rmacklem@uoguelph.ca> said:

> I suspect this indicates that it isn't mutex contention, since the
> threads would block waiting for the mutex for that case, I think?

No, because our mutexes are adaptive, so each thread spins for a while
before blocking.  With the current implementation, all of them end up
doing this in pretty close to lock-step.

> (Bumping up NFSRVCACHE_HASHSIZE can't hurt if/when you get the chance.)

I already have it set to 129 (up from 20); I could see putting it up
to, say, 1023.  It would be nice to have a sysctl for maximum chain
length to see how bad it's getting (and if the hash function is
actually effective).

> I've thought about this. My concern is that the separate thread might
> not keep up with the trimming demand. If that occurred, the cache would
> grow veryyy laarrggge, with effects like running out of mbuf clusters.

At a minimum, once one nfsd thread is committed to doing the cache
trim, a flag should be set to discourage other threads from trying to
do it.  Having them all spinning their wheels punishes the clients
much too much.

> By having the nfsd threads do it, they slow down, which provides feedback
> to the clients (slower RPC replies->generate fewer request->less to cache).
> (I think you are probably familiar with the generic concept that a system
>  needs feedback to remain stable. An M/M/1 queue with open arrivals and
>  no feedback to slow the arrival rate explodes when the arrival rate
>  approaches the service rate, etc and so on...)

Unfortunately, the feedback channel that I have is: one user starts
500 virtual machines accessing a filesystem on the server -> other
users of this server see their goodput go to zero -> everyone sends in
angry trouble tickets -> I increase the DRC size manually.  It would
be nice if, by the time I next want to take a vacation, I have this
figured out.

I'm OK with throwing memory at the problem -- these servers have 96 GB
and can hold up to 144 GB -- so long as I can find a tuning that
provides stability and consistent, reasonable performance for the
users.

> The nfs server does soreserve(so, sb_max_adj, sb_max_adj); I can't
> recall exactly why it is that way, except that it needs to be large
> enough to handle the largest RPC request a client might generate.

> I should take another look at this, in case sb_max_adj is now
> too large?

It probably shouldn't be larger than the
net.inet.tcp.{send,recv}buf_max, and the read and write sizes that are
negotiated should be chosen so that a whole RPC can fit in that
space.  If that's too hard for whatever reason, nfsd should at least
log a message saying "hey, your socket buffer limits are too small,
I'm going to ignore them".

-GAWollman