From owner-freebsd-net@FreeBSD.ORG Fri Mar 8 08:31:19 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 77458ABE; Fri, 8 Mar 2013 08:31:19 +0000 (UTC) (envelope-from jfvogel@gmail.com) Received: from mail-vb0-x22d.google.com (mail-vb0-x22d.google.com [IPv6:2607:f8b0:400c:c02::22d]) by mx1.freebsd.org (Postfix) with ESMTP id E3F15FB; Fri, 8 Mar 2013 08:31:18 +0000 (UTC) Received: by mail-vb0-f45.google.com with SMTP id p1so541209vbi.18 for ; Fri, 08 Mar 2013 00:31:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=qPvTnpHswbOK3Z0nN+72XEjySWIXeQkiYSdWHWSEGo8=; b=sS/WOwlYJtiR62rUHNnTTy//dzLQwUQP1iWD+fsC0k8y+l6D2I6nn1GGNKUnS8sMaG 2qQ+jD2vpgdzPS3siDedT34fGf+0m5HdgLCqQVIOfkqpVuiECDz7qQIIAnpXr+CLrPOX ghJayHkWLJsYHZImGsahYnj5yTrlyc4dFRA9A6vV1AtwD14ny3zeZhN+KmvsWR7bPkEi meJSLpzW7hB7jCMDhuTCOBrOAdBwOWQL7g/sJFrT16Li4Xwoi0HivqfHhdkRErh/umLt S2ollcl7IQUNV9U3j+k28w+UjPHwO+VU6jphh0RNtxexmwOqw7K6kRRgQxa+uXKuf8HQ /gOQ== MIME-Version: 1.0 X-Received: by 10.52.19.239 with SMTP id i15mr520505vde.47.1362731478407; Fri, 08 Mar 2013 00:31:18 -0800 (PST) Received: by 10.220.191.132 with HTTP; Fri, 8 Mar 2013 00:31:18 -0800 (PST) In-Reply-To: <51399926.6020201@freebsd.org> References: <20793.36593.774795.720959@hergotha.csail.mit.edu> <51399926.6020201@freebsd.org> Date: Fri, 8 Mar 2013 00:31:18 -0800 Message-ID: Subject: Re: Limits on jumbo mbuf cluster allocation From: Jack Vogel To: Andre Oppermann Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: jfv@freebsd.org, freebsd-net@freebsd.org, Garrett Wollman X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 08 Mar 2013 08:31:19 -0000 On Thu, Mar 7, 2013 at 11:54 PM, Andre Oppermann wrote: > On 08.03.2013 08:10, Garrett Wollman wrote: > >> I have a machine (actually six of them) with an Intel dual-10G NIC on >> the motherboard. Two of them (so far) are connected to a network >> using jumbo frames, with an MTU a little under 9k, so the ixgbe driver >> allocates 32,000 9k clusters for its receive rings. I have noticed, >> on the machine that is an active NFS server, that it can get into a >> state where allocating more 9k clusters fails (as reflected in the >> mbuf failure counters) at a utilization far lower than the configured >> limits -- in fact, quite close to the number allocated by the driver >> for its rx ring. Eventually, network traffic grinds completely to a >> halt, and if one of the interfaces is administratively downed, it >> cannot be brought back up again. There's generally plenty of physical >> memory free (at least two or three GB). >> > > You have an amd64 kernel running HEAD or 9.x? > > > There are no console messages generated to indicate what is going on, >> and overall UMA usage doesn't look extreme. I'm guessing that this is >> a result of kernel memory fragmentation, although I'm a little bit >> unclear as to how this actually comes about. I am assuming that this >> hardware has only limited scatter-gather capability and can't receive >> a single packet into multiple buffers of a smaller size, which would >> reduce the requirement for two-and-a-quarter consecutive pages of KVA >> for each packet. In actual usage, most of our clients aren't on a >> jumbo network, so most of the time, all the packets will fit into a >> normal 2k cluster, and we've never observed this issue when the >> *server* is on a non-jumbo network. >> >> Does anyone have suggestions for dealing with this issue? Will >> increasing the amount of KVA (to, say, twice physical memory) help >> things? It seems to me like a bug that these large packets don't have >> their own submap to ensure that allocation is always possible when >> sufficient physical pages are available. >> > > Jumbo pages come directly from the kernel_map which on amd64 is 512GB. > So KVA shouldn't be a problem. Your problem indeed appears to come > physical memory fragmentation in pmap. There is a buddy memory > allocator at work but I fear it runs into serious trouble when it has > to allocate a large number of objects spanning more than 2 contiguous > pages. Also since you're doing NFS serving almost all memory will be > in use for file caching. > > Running a NIC with jumbo frames enabled gives some interesting trade- > offs. Unfortunately most NIC's can't have multiple DMA buffer sizes > on the same receive queue and pick the best size for the incoming frame. > That means they need to use largest jumbo mbuf for all receive traffic, > even a tiny 40 byte ACK. The send side is not constrained in such a way > and tries to use PAGE_SIZE clusters for socket buffers whenever it can. > > Many, but not all, NIC's are able to split a received jumbo frame into > multiple smaller DMA segments forming an mbuf chain. The ixgbe hardware > is capable of doing this, though the driver supports it but doesn't > actively makes use of it. > > Another issue with many drivers is their inability to deal with mbuf > allocation failure for their receive DMA ring. They try to fill it > up to the maximal ring size and balk on failure. Rings have become > very big and usually are a power of two. The driver could function > with a partially filled RX ring too, maybe with some performance > impact when it gets really low. On every rxeof it tries to refill > the ring, so when resources become available again it'd balance out. > NIC's with multiple receive queues/rings make this problem even more > acute. > > A theoretical fix would be to dedicate an entire super page of 1GB > or so exclusively to the jumbo frame UMA zone as backing memory. That > memory is gone for all other uses though, even if not actually used. > Allocating the superpage and determining its size would have to be > done manually by setting loader variables. I don't see a reasonable > way to do this with autotuning because it requires advance knowledge > of the usage patters. > > IMHO the right fix is to strongly discourage use of jumbo clusters > larger than PAGE_SIZE when the hardware is capable of splitting the > frame into multiple clusters. The allocation constraint then is only > available memory and no longer contiguous pages. Also the waste > factor for small frames is much lower. The performance impact is > minimal to non-existent. In addition drivers shouldn't break down > when the RX ring can't be filled to the max. > > I recently got yelled at for suggesting to remove jumbo > PAGE_SIZE. > However your case proves that such jumbo frames are indeed their own > can of worms and should really only and exclusively be used for NIC's > that have to do jumbo *and* are incapable of RX scatter DMA. > > I am not strongly opposed to trying the 4k mbuf pool for all larger sizes, Garrett maybe if you would try that on your system and see if that helps you, I could envision making this a tunable at some point perhaps? Thanks for the input Andre. Jack