From owner-freebsd-net@FreeBSD.ORG  Fri Mar  8 08:39:47 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 727C9D61;
 Fri,  8 Mar 2013 08:39:47 +0000 (UTC)
 (envelope-from pyunyh@gmail.com)
Received: from mail-pa0-f54.google.com (mail-pa0-f54.google.com
 [209.85.220.54]) by mx1.freebsd.org (Postfix) with ESMTP id 437FB13D;
 Fri,  8 Mar 2013 08:39:47 +0000 (UTC)
Received: by mail-pa0-f54.google.com with SMTP id fa10so1149766pad.27
 for <multiple recipients>; Fri, 08 Mar 2013 00:39:41 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=x-received:from:date:to:cc:subject:message-id:reply-to:references
 :mime-version:content-type:content-disposition:in-reply-to
 :user-agent; bh=hcaP4Rxws/QUH/cTA6IdYsVtS4rQ7XmdHcWQlat+qzw=;
 b=FzHrlkExhdrQDeLCHCKg/dkyQJEJos++rKokAApLBYrcGE/qcBUvmcdPbIVKqq9X6F
 jpqO+K+dgSHroOwe4jYHRNibtLAr+3alP56zpI3n7emFODykzut6HbKY2aSl0RoSpLhV
 5aMC+ORTU5e7Ejrdos9R/xq4uE6yZTQoAfZ/wyXnwRvyT2tYEVfNE4rPbkrk+SfnDhQ2
 5WAn5fsFSsKVUBx2N88fumm6gRTCXaCKrF2gNCyaTCuZmj5TnGJFitmrSN3V6eHNEbW5
 qS7c8Tpl5BwnLXMwAdcQoI/wYcAOSylLnqpcVJJzImsNiCJSUTjUmeXLrHR5iw0zPtbB
 aufw==
X-Received: by 10.66.9.69 with SMTP id x5mr2713559paa.204.1362731981669;
 Fri, 08 Mar 2013 00:39:41 -0800 (PST)
Received: from pyunyh@gmail.com (lpe4.p59-icn.cdngp.net. [114.111.62.249])
 by mx.google.com with ESMTPS id ip8sm4822866pbc.39.2013.03.08.00.39.37
 (version=TLSv1 cipher=RC4-SHA bits=128/128);
 Fri, 08 Mar 2013 00:39:40 -0800 (PST)
Received: by pyunyh@gmail.com (sSMTP sendmail emulation);
 Fri, 08 Mar 2013 17:39:32 +0900
From: YongHyeon PYUN <pyunyh@gmail.com>
Date: Fri, 8 Mar 2013 17:39:32 +0900
To: Jack Vogel <jfvogel@gmail.com>
Subject: Re: Limits on jumbo mbuf cluster allocation
Message-ID: <20130308083932.GB1442@michelle.cdnetworks.com>
References: <20793.36593.774795.720959@hergotha.csail.mit.edu>
 <20130308075458.GA1442@michelle.cdnetworks.com>
 <CAFOYbckHDeuwmcPZzhewqrAju3GZ8er6nnTVgkNeVhvH4k=ydQ@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAFOYbckHDeuwmcPZzhewqrAju3GZ8er6nnTVgkNeVhvH4k=ydQ@mail.gmail.com>
User-Agent: Mutt/1.4.2.3i
Cc: jfv@freebsd.org, freebsd-net@freebsd.org,
 Garrett Wollman <wollman@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
Reply-To: pyunyh@gmail.com
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 08 Mar 2013 08:39:47 -0000

On Fri, Mar 08, 2013 at 12:27:37AM -0800, Jack Vogel wrote:
> On Thu, Mar 7, 2013 at 11:54 PM, YongHyeon PYUN <pyunyh@gmail.com> wrote:
> 
> > On Fri, Mar 08, 2013 at 02:10:41AM -0500, Garrett Wollman wrote:
> > > I have a machine (actually six of them) with an Intel dual-10G NIC on
> > > the motherboard.  Two of them (so far) are connected to a network
> > > using jumbo frames, with an MTU a little under 9k, so the ixgbe driver
> > > allocates 32,000 9k clusters for its receive rings.  I have noticed,
> > > on the machine that is an active NFS server, that it can get into a
> > > state where allocating more 9k clusters fails (as reflected in the
> > > mbuf failure counters) at a utilization far lower than the configured
> > > limits -- in fact, quite close to the number allocated by the driver
> > > for its rx ring.  Eventually, network traffic grinds completely to a
> > > halt, and if one of the interfaces is administratively downed, it
> > > cannot be brought back up again.  There's generally plenty of physical
> > > memory free (at least two or three GB).
> > >
> > > There are no console messages generated to indicate what is going on,
> > > and overall UMA usage doesn't look extreme.  I'm guessing that this is
> > > a result of kernel memory fragmentation, although I'm a little bit
> > > unclear as to how this actually comes about.  I am assuming that this
> > > hardware has only limited scatter-gather capability and can't receive
> > > a single packet into multiple buffers of a smaller size, which would
> > > reduce the requirement for two-and-a-quarter consecutive pages of KVA
> > > for each packet.  In actual usage, most of our clients aren't on a
> > > jumbo network, so most of the time, all the packets will fit into a
> > > normal 2k cluster, and we've never observed this issue when the
> > > *server* is on a non-jumbo network.
> > >
> >
> > AFAIK all Intel controllers generate jumbo frame by concatenating
> > multiple mbufs on RX side so there is no physically contiguous 9KB
> > allocation. I vaguely guess there could be mbuf leakage when jumbo
> > frame is enabled. I would check how driver handles mbuf shortage or
> > frame errors while mbuf concatenation for jumbo frame is in
> > progress.
> >
> 
> No, this is not true, if using a 9K jumbo it will actually use the larger
> mbuf pool, the code has been this way for a little while now.

Ah, thanks for correcting me. If H/W is still able to support old
style chaining like em(4), wouldn't it better to use that rather
than allocating a 9KB buffer? Allocating a 9KB buffer to handle a
pure TCP ACK segment looks inefficient.

> 
> Jack
> 
> 
> >
> > > Does anyone have suggestions for dealing with this issue?  Will
> > > increasing the amount of KVA (to, say, twice physical memory) help
> > > things?  It seems to me like a bug that these large packets don't have
> > > their own submap to ensure that allocation is always possible when
> > > sufficient physical pages are available.
> > >
> > > -GAWollman
> >