From owner-freebsd-net@FreeBSD.ORG  Wed Apr 29 05:08:04 2015
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id AAA56441
 for <freebsd-net@FreeBSD.org>; Wed, 29 Apr 2015 05:08:04 +0000 (UTC)
Received: from hergotha.csail.mit.edu
 (wollman-1-pt.tunnel.tserv4.nyc4.ipv6.he.net [IPv6:2001:470:1f06:ccb::2])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 687D317C8
 for <freebsd-net@FreeBSD.org>; Wed, 29 Apr 2015 05:08:04 +0000 (UTC)
Received: from hergotha.csail.mit.edu (localhost [127.0.0.1])
 by hergotha.csail.mit.edu (8.14.9/8.14.9) with ESMTP id t3T581RE047109;
 Wed, 29 Apr 2015 01:08:01 -0400 (EDT)
 (envelope-from wollman@hergotha.csail.mit.edu)
Received: (from wollman@localhost)
 by hergotha.csail.mit.edu (8.14.9/8.14.4/Submit) id t3T580cr047106;
 Wed, 29 Apr 2015 01:08:00 -0400 (EDT) (envelope-from wollman)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <21824.26416.855441.21454@hergotha.csail.mit.edu>
Date: Wed, 29 Apr 2015 01:08:00 -0400
From: Garrett Wollman <wollman@bimajority.org>
To: Rick Macklem <rmacklem@uoguelph.ca>
Cc: Mark Schouten <mark@tuxis.nl>, freebsd-net@FreeBSD.org
Subject: Re: Frequent hickups on the networking layer
In-Reply-To: <137094161.27589033.1430255162390.JavaMail.root@uoguelph.ca>
References: <4281350517-9417@kerio.tuxis.nl>
 <137094161.27589033.1430255162390.JavaMail.root@uoguelph.ca>
X-Mailer: VM 7.17 under 21.4 (patch 22) "Instant Classic" XEmacs Lucid
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3
 (hergotha.csail.mit.edu [127.0.0.1]); Wed, 29 Apr 2015 01:08:01 -0400 (EDT)
X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED,
 HEADER_FROM_DIFFERENT_DOMAINS autolearn=disabled version=3.4.0
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
 hergotha.csail.mit.edu
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Apr 2015 05:08:04 -0000

<<On Tue, 28 Apr 2015 17:06:02 -0400 (EDT), Rick Macklem <rmacklem@uoguelph.ca> said:

> There have been email list threads discussing how allocating 9K jumbo
> mbufs will fragment the KVM (kernel virtual memory) used for mbuf
> cluster allocation and cause grief.

The problem is not KVA fragmentation -- the clusters come from a
separate map which should prevent that -- it's that clusters have to
be physically contiguous, and an active machine is going to have
trouble with that.  The fact that 9k is a goofy size (two pages plus a
little bit) doesn't help matters.

The other side, as Neel and others have pointed out, is that it's
beneficial for the hardware to have a big chunk of physically
contiguous memory to dump packets into, especially with various kinds
of receive-side offloading.

I see two solutions to this, but don't have the time or resources (or,
frankly, the need) to implement them (and both are probably required
for different situations):

1) Reserve a big chunk of physical memory early on for big clusters.
How much this needs to be will depend on the application and the
particular network interface hardware, but you should be thinking in
terms of megabytes or (on a big server) gigabytes.  Big enough to be
mapped as superpages on hardware where that's beneficial.  If you have
aggressive LRO, "big clusters" might be 64k or larger in size.

2) Use the IOMMU -- if it's available, which it won't be when running
under a hypervisor that's already using it for passthrough -- to
obviate the need for physically contiguous pages; then the problem
reduces to KVA fragmentation, which is easier to avoid in the
allocator.

> As far as I know (just from email discussion, never used them myself),
> you can either stop using jumbo packets or switch to a different net
> interface that doesn't allocate 9K jumbo mbufs (doing the receives of
> jumbo packets into a list of smaller mbuf clusters).

Or just hack the driver to not use them.  For the Intel drivers this
is easy, and at least for the hardware I have there's no benefit to
using 9k clusters over 4k; for Chelsio it's quite a bit harder.

-GAWollman