Date: Fri, 8 Mar 2013 00:31:18 -0800 From: Jack Vogel <jfvogel@gmail.com> To: Andre Oppermann <andre@freebsd.org> Cc: jfv@freebsd.org, freebsd-net@freebsd.org, Garrett Wollman <wollman@freebsd.org> Subject: Re: Limits on jumbo mbuf cluster allocation Message-ID: <CAFOYbc=x7U-s70KvcZJdrVP6v-On716qMi=HN1P2Kj%2Bd_K972A@mail.gmail.com> In-Reply-To: <51399926.6020201@freebsd.org> References: <20793.36593.774795.720959@hergotha.csail.mit.edu> <51399926.6020201@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Mar 7, 2013 at 11:54 PM, Andre Oppermann <andre@freebsd.org> wrote: > On 08.03.2013 08:10, Garrett Wollman wrote: > >> I have a machine (actually six of them) with an Intel dual-10G NIC on >> the motherboard. Two of them (so far) are connected to a network >> using jumbo frames, with an MTU a little under 9k, so the ixgbe driver >> allocates 32,000 9k clusters for its receive rings. I have noticed, >> on the machine that is an active NFS server, that it can get into a >> state where allocating more 9k clusters fails (as reflected in the >> mbuf failure counters) at a utilization far lower than the configured >> limits -- in fact, quite close to the number allocated by the driver >> for its rx ring. Eventually, network traffic grinds completely to a >> halt, and if one of the interfaces is administratively downed, it >> cannot be brought back up again. There's generally plenty of physical >> memory free (at least two or three GB). >> > > You have an amd64 kernel running HEAD or 9.x? > > > There are no console messages generated to indicate what is going on, >> and overall UMA usage doesn't look extreme. I'm guessing that this is >> a result of kernel memory fragmentation, although I'm a little bit >> unclear as to how this actually comes about. I am assuming that this >> hardware has only limited scatter-gather capability and can't receive >> a single packet into multiple buffers of a smaller size, which would >> reduce the requirement for two-and-a-quarter consecutive pages of KVA >> for each packet. In actual usage, most of our clients aren't on a >> jumbo network, so most of the time, all the packets will fit into a >> normal 2k cluster, and we've never observed this issue when the >> *server* is on a non-jumbo network. >> >> Does anyone have suggestions for dealing with this issue? Will >> increasing the amount of KVA (to, say, twice physical memory) help >> things? It seems to me like a bug that these large packets don't have >> their own submap to ensure that allocation is always possible when >> sufficient physical pages are available. >> > > Jumbo pages come directly from the kernel_map which on amd64 is 512GB. > So KVA shouldn't be a problem. Your problem indeed appears to come > physical memory fragmentation in pmap. There is a buddy memory > allocator at work but I fear it runs into serious trouble when it has > to allocate a large number of objects spanning more than 2 contiguous > pages. Also since you're doing NFS serving almost all memory will be > in use for file caching. > > Running a NIC with jumbo frames enabled gives some interesting trade- > offs. Unfortunately most NIC's can't have multiple DMA buffer sizes > on the same receive queue and pick the best size for the incoming frame. > That means they need to use largest jumbo mbuf for all receive traffic, > even a tiny 40 byte ACK. The send side is not constrained in such a way > and tries to use PAGE_SIZE clusters for socket buffers whenever it can. > > Many, but not all, NIC's are able to split a received jumbo frame into > multiple smaller DMA segments forming an mbuf chain. The ixgbe hardware > is capable of doing this, though the driver supports it but doesn't > actively makes use of it. > > Another issue with many drivers is their inability to deal with mbuf > allocation failure for their receive DMA ring. They try to fill it > up to the maximal ring size and balk on failure. Rings have become > very big and usually are a power of two. The driver could function > with a partially filled RX ring too, maybe with some performance > impact when it gets really low. On every rxeof it tries to refill > the ring, so when resources become available again it'd balance out. > NIC's with multiple receive queues/rings make this problem even more > acute. > > A theoretical fix would be to dedicate an entire super page of 1GB > or so exclusively to the jumbo frame UMA zone as backing memory. That > memory is gone for all other uses though, even if not actually used. > Allocating the superpage and determining its size would have to be > done manually by setting loader variables. I don't see a reasonable > way to do this with autotuning because it requires advance knowledge > of the usage patters. > > IMHO the right fix is to strongly discourage use of jumbo clusters > larger than PAGE_SIZE when the hardware is capable of splitting the > frame into multiple clusters. The allocation constraint then is only > available memory and no longer contiguous pages. Also the waste > factor for small frames is much lower. The performance impact is > minimal to non-existent. In addition drivers shouldn't break down > when the RX ring can't be filled to the max. > > I recently got yelled at for suggesting to remove jumbo > PAGE_SIZE. > However your case proves that such jumbo frames are indeed their own > can of worms and should really only and exclusively be used for NIC's > that have to do jumbo *and* are incapable of RX scatter DMA. > > I am not strongly opposed to trying the 4k mbuf pool for all larger sizes, Garrett maybe if you would try that on your system and see if that helps you, I could envision making this a tunable at some point perhaps? Thanks for the input Andre. Jack
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAFOYbc=x7U-s70KvcZJdrVP6v-On716qMi=HN1P2Kj%2Bd_K972A>