From owner-freebsd-net@FreeBSD.ORG  Sun Mar 10 02:22:57 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 8533CDD
 for <freebsd-net@freebsd.org>; Sun, 10 Mar 2013 02:22:57 +0000 (UTC)
 (envelope-from rmacklem@uoguelph.ca)
Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca
 [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 2409970E
 for <freebsd-net@freebsd.org>; Sun, 10 Mar 2013 02:22:56 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AqEEAMTsO1GDaFvO/2dsb2JhbABCDogavCSBdHSCLAEBAQMBAQEBICsgCwUWGAICDRkCKQEJJgYIBwQBHASHbAYMqiKRboEjjCoKAQV9NAeCLYETA4hyiyWCPoEej1eCS10eMnwBCBce
X-IronPort-AV: E=Sophos;i="4.84,816,1355115600"; d="scan'208";a="20300428"
Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca)
 ([131.104.91.206])
 by esa-jnhn.mail.uoguelph.ca with ESMTP; 09 Mar 2013 21:22:49 -0500
Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1])
 by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id BC572B3EEA;
 Sat,  9 Mar 2013 21:22:49 -0500 (EST)
Date: Sat, 9 Mar 2013 21:22:49 -0500 (EST)
From: Rick Macklem <rmacklem@uoguelph.ca>
To: Garrett Wollman <wollman@hergotha.csail.mit.edu>
Message-ID: <1841214504.3736248.1362882169721.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <201303091846.r29Ik9jX062596@hergotha.csail.mit.edu>
Subject: Re: Limits on jumbo mbuf cluster allocation
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Originating-IP: [172.17.91.202]
X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692)
Cc: freebsd-net@freebsd.org
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 10 Mar 2013 02:22:57 -0000

Garett Wollman wrote:
> In article <20795.29370.194678.963351@hergotha.csail.mit.edu>, I
> wrote:
> ><<On Sat, 9 Mar 2013 11:50:30 -0500 (EST), Rick Macklem
> ><rmacklem@uoguelph.ca> said:
> >> I've thought about this. My concern is that the separate thread
> >> might
> >> not keep up with the trimming demand. If that occurred, the cache
> >> would
> >> grow veryyy laarrggge, with effects like running out of mbuf
> >> clusters.
> >
> >At a minimum, once one nfsd thread is committed to doing the cache
> >trim, a flag should be set to discourage other threads from trying to
> >do it. Having them all spinning their wheels punishes the clients
> >much too much.
> 
Yes, I think this is a good idea. The current code acquires the mutex
before updating the once/sec variable. As such it would be easy to
get multiple threads in there concurrently.

This is easy to do. Just define a static variable in nfsrc_trimcache()
initially 0.
If not 0,
   return.
set non-zero.
do the trimming.
set zero before returning.

Since this is just a heuristic to avoid multiple threads doing the
trim concurrently, I think it can be safely done outside of the mutex.

If you need help coding this, just email and I can come up with a
quick patch.

> Also, it occurs to me that this strategy is subject to livelock. To
> put backpressure on the clients, it is far better to get them to stop
> sending (by advertising a small receive window) than to accept their
> traffic but queue it for a long time. By the time the NFS code gets
> an RPC, the system has already invested so much into it that it should
> be processed as quickly as possible, and this strategy essentially
> guarantees[1] that, once those 2 MB socket buffers start to fill up,
> they
> will stay filled, sending latency through the roof. If nfsd didn't
> override the usual socket-buffer sizing mechanisms, then sysadmins
> could limit the buffers to ensure a stable response time.
> 
> The bandwidth-delay product in our network is somewhere between 12.5
> kB and 125 kB, depending on how the client is connected and what sort
> of latency they experience. The usual theory would suggest that
> socket buffers should be no more than twice that -- i.e., about 256
> kB.
> 
Well, the code that uses sb_max_adj wasn't written by me (I just cloned
it for the new server). In the author's defence, I believe SB_MAX was 256K when
it was written. It was 256K in 2011. I think sb_max_adj was used because
soreserve() fails for a larger value and the code doesn't check for such a failure.
(Yea, it should be fixed so that it checks for a failure return from soreserve().
 I did so for the client some time ago.;-)

Just grep for sb_max_adj. You'll see it sets a variable called "siz".
Make "siz" whatever you want (256K sounds like a good guess). Just make
sure it isn't > sb_max_adj.

The I/O sizes are limited to MAXBSIZE, which is currently 64Kb, although
I'd like to increase that to 128Kb someday soon. (As you note below, the
largest RPC is slightly bigger than that.)

Btw, net.inet.tcp.{send/recv}buf_max are both 2Mbytes, just like sb_max,
so those don't seem useful in this case?

I'm no TCP guy, so suggestions w.r.t. how big soreserve() should be set
are welcome.

> I'd actually like to see something like WFQ in the NFS server to allow
> me to limit the amount of damage one client or group of clients can
> do without unnecessarily limiting other clients.
> 
Sorry, I'll admit I have no idea what WFQ is? (I'll look it up on some
web site someday soon, but obviously can't comment before then.) Since
it is possible to receive RPC requests for a given client from multiple
IP addresses, it is pretty hard for NFS to know what client a request
has come from.

rick

> -GAWollman
> 
> [1] The largest RPC is a bit more than 64 KiB (negotiated), so if the
> server gets slow, the 2 MB receive queue will be refilled by the
> client before the server manages to perform the RPC and send a
> response.
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"