From owner-freebsd-net@FreeBSD.ORG  Fri Mar  8 08:31:19 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 77458ABE;
 Fri,  8 Mar 2013 08:31:19 +0000 (UTC)
 (envelope-from jfvogel@gmail.com)
Received: from mail-vb0-x22d.google.com (mail-vb0-x22d.google.com
 [IPv6:2607:f8b0:400c:c02::22d])
 by mx1.freebsd.org (Postfix) with ESMTP id E3F15FB;
 Fri,  8 Mar 2013 08:31:18 +0000 (UTC)
Received: by mail-vb0-f45.google.com with SMTP id p1so541209vbi.18
 for <multiple recipients>; Fri, 08 Mar 2013 00:31:18 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:x-received:in-reply-to:references:date:message-id
 :subject:from:to:cc:content-type;
 bh=qPvTnpHswbOK3Z0nN+72XEjySWIXeQkiYSdWHWSEGo8=;
 b=sS/WOwlYJtiR62rUHNnTTy//dzLQwUQP1iWD+fsC0k8y+l6D2I6nn1GGNKUnS8sMaG
 2qQ+jD2vpgdzPS3siDedT34fGf+0m5HdgLCqQVIOfkqpVuiECDz7qQIIAnpXr+CLrPOX
 ghJayHkWLJsYHZImGsahYnj5yTrlyc4dFRA9A6vV1AtwD14ny3zeZhN+KmvsWR7bPkEi
 meJSLpzW7hB7jCMDhuTCOBrOAdBwOWQL7g/sJFrT16Li4Xwoi0HivqfHhdkRErh/umLt
 S2ollcl7IQUNV9U3j+k28w+UjPHwO+VU6jphh0RNtxexmwOqw7K6kRRgQxa+uXKuf8HQ
 /gOQ==
MIME-Version: 1.0
X-Received: by 10.52.19.239 with SMTP id i15mr520505vde.47.1362731478407; Fri,
 08 Mar 2013 00:31:18 -0800 (PST)
Received: by 10.220.191.132 with HTTP; Fri, 8 Mar 2013 00:31:18 -0800 (PST)
In-Reply-To: <51399926.6020201@freebsd.org>
References: <20793.36593.774795.720959@hergotha.csail.mit.edu>
 <51399926.6020201@freebsd.org>
Date: Fri, 8 Mar 2013 00:31:18 -0800
Message-ID: <CAFOYbc=x7U-s70KvcZJdrVP6v-On716qMi=HN1P2Kj+d_K972A@mail.gmail.com>
Subject: Re: Limits on jumbo mbuf cluster allocation
From: Jack Vogel <jfvogel@gmail.com>
To: Andre Oppermann <andre@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
Cc: jfv@freebsd.org, freebsd-net@freebsd.org,
 Garrett Wollman <wollman@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 08 Mar 2013 08:31:19 -0000

On Thu, Mar 7, 2013 at 11:54 PM, Andre Oppermann <andre@freebsd.org> wrote:

> On 08.03.2013 08:10, Garrett Wollman wrote:
>
>> I have a machine (actually six of them) with an Intel dual-10G NIC on
>> the motherboard.  Two of them (so far) are connected to a network
>> using jumbo frames, with an MTU a little under 9k, so the ixgbe driver
>> allocates 32,000 9k clusters for its receive rings.  I have noticed,
>> on the machine that is an active NFS server, that it can get into a
>> state where allocating more 9k clusters fails (as reflected in the
>> mbuf failure counters) at a utilization far lower than the configured
>> limits -- in fact, quite close to the number allocated by the driver
>> for its rx ring.  Eventually, network traffic grinds completely to a
>> halt, and if one of the interfaces is administratively downed, it
>> cannot be brought back up again.  There's generally plenty of physical
>> memory free (at least two or three GB).
>>
>
> You have an amd64 kernel running HEAD or 9.x?
>
>
>  There are no console messages generated to indicate what is going on,
>> and overall UMA usage doesn't look extreme.  I'm guessing that this is
>> a result of kernel memory fragmentation, although I'm a little bit
>> unclear as to how this actually comes about.  I am assuming that this
>> hardware has only limited scatter-gather capability and can't receive
>> a single packet into multiple buffers of a smaller size, which would
>> reduce the requirement for two-and-a-quarter consecutive pages of KVA
>> for each packet.  In actual usage, most of our clients aren't on a
>> jumbo network, so most of the time, all the packets will fit into a
>> normal 2k cluster, and we've never observed this issue when the
>> *server* is on a non-jumbo network.
>>
>> Does anyone have suggestions for dealing with this issue?  Will
>> increasing the amount of KVA (to, say, twice physical memory) help
>> things?  It seems to me like a bug that these large packets don't have
>> their own submap to ensure that allocation is always possible when
>> sufficient physical pages are available.
>>
>
> Jumbo pages come directly from the kernel_map which on amd64 is 512GB.
> So KVA shouldn't be a problem.  Your problem indeed appears to come
> physical memory fragmentation in pmap.  There is a buddy memory
> allocator at work but I fear it runs into serious trouble when it has
> to allocate a large number of objects spanning more than 2 contiguous
> pages.  Also since you're doing NFS serving almost all memory will be
> in use for file caching.
>
> Running a NIC with jumbo frames enabled gives some interesting trade-
> offs.  Unfortunately most NIC's can't have multiple DMA buffer sizes
> on the same receive queue and pick the best size for the incoming frame.
> That means they need to use largest jumbo mbuf for all receive traffic,
> even a tiny 40 byte ACK.  The send side is not constrained in such a way
> and tries to use PAGE_SIZE clusters for socket buffers whenever it can.
>
> Many, but not all, NIC's are able to split a received jumbo frame into
> multiple smaller DMA segments forming an mbuf chain.  The ixgbe hardware
> is capable of doing this, though the driver supports it but doesn't
> actively makes use of it.
>
> Another issue with many drivers is their inability to deal with mbuf
> allocation failure for their receive DMA ring.  They try to fill it
> up to the maximal ring size and balk on failure.  Rings have become
> very big and usually are a power of two.  The driver could function
> with a partially filled RX ring too, maybe with some performance
> impact when it gets really low.  On every rxeof it tries to refill
> the ring, so when resources become available again it'd balance out.
> NIC's with multiple receive queues/rings make this problem even more
> acute.
>
> A theoretical fix would be to dedicate an entire super page of 1GB
> or so exclusively to the jumbo frame UMA zone as backing memory.  That
> memory is gone for all other uses though, even if not actually used.
> Allocating the superpage and determining its size would have to be
> done manually by setting loader variables.  I don't see a reasonable
> way to do this with autotuning because it requires advance knowledge
> of the usage patters.
>
> IMHO the right fix is to strongly discourage use of jumbo clusters
> larger than PAGE_SIZE when the hardware is capable of splitting the
> frame into multiple clusters.  The allocation constraint then is only
> available memory and no longer contiguous pages.  Also the waste
> factor for small frames is much lower.  The performance impact is
> minimal to non-existent.  In addition drivers shouldn't break down
> when the RX ring can't be filled to the max.
>
> I recently got yelled at for suggesting to remove jumbo > PAGE_SIZE.
> However your case proves that such jumbo frames are indeed their own
> can of worms and should really only and exclusively be used for NIC's
> that have to do jumbo *and* are incapable of RX scatter DMA.
>
>
I am not strongly opposed to trying the 4k mbuf pool for all larger sizes,
Garrett maybe if you would try that on your system and see if that helps
you, I could envision making this a tunable at some point perhaps?

Thanks for the input Andre.

Jack