From owner-freebsd-net@freebsd.org Sun Jul 29 01:11:56 2018 Return-Path: Delivered-To: freebsd-net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 235C91063A26 for ; Sun, 29 Jul 2018 01:11:56 +0000 (UTC) (envelope-from jmg@gold.funkthat.com) Received: from gold.funkthat.com (gate2.funkthat.com [208.87.223.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "gate2.funkthat.com", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 95D1A88FE5 for ; Sun, 29 Jul 2018 01:11:55 +0000 (UTC) (envelope-from jmg@gold.funkthat.com) Received: from gold.funkthat.com (localhost [127.0.0.1]) by gold.funkthat.com (8.15.2/8.15.2) with ESMTPS id w6T1Bs2W043512 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Sat, 28 Jul 2018 18:11:54 -0700 (PDT) (envelope-from jmg@gold.funkthat.com) Received: (from jmg@localhost) by gold.funkthat.com (8.15.2/8.15.2/Submit) id w6T1BrmK043509; Sat, 28 Jul 2018 18:11:53 -0700 (PDT) (envelope-from jmg) Date: Sat, 28 Jul 2018 18:11:53 -0700 From: John-Mark Gurney To: Adrian Chadd Cc: ryan@ixsystems.com, FreeBSD Net Subject: Re: 9k jumbo clusters Message-ID: <20180729011153.GD2884@funkthat.com> Mail-Followup-To: Adrian Chadd , ryan@ixsystems.com, FreeBSD Net References: <20180727221843.GZ2884@funkthat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Operating-System: FreeBSD 11.0-RELEASE-p7 amd64 X-PGP-Fingerprint: D87A 235F FB71 1F3F 55B7 ED9B D5FF 5A51 C0AC 3D65 X-Files: The truth is out there X-URL: https://www.funkthat.com/ X-Resume: https://www.funkthat.com/~jmg/resume.html X-TipJar: bitcoin:13Qmb6AeTgQecazTWph4XasEsP7nGRbAPE X-to-the-FBI-CIA-and-NSA: HI! HOW YA DOIN? can i haz chizburger? User-Agent: Mutt/1.6.1 (2016-04-27) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (gold.funkthat.com [127.0.0.1]); Sat, 28 Jul 2018 18:11:54 -0700 (PDT) X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.27 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 29 Jul 2018 01:11:56 -0000 Adrian Chadd wrote this message on Sat, Jul 28, 2018 at 13:33 -0700: > On Fri, 27 Jul 2018 at 15:19, John-Mark Gurney wrote: > > > Ryan Moeller wrote this message on Fri, Jul 27, 2018 at 12:45 -0700: > > > There is a long-standing issue with 9k mbuf jumbo clusters in FreeBSD. > > > For example: > > > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=183381 > > > https://lists.freebsd.org/pipermail/freebsd-net/2013-March/034890.html > > > > > > This comment suggests the 16k pool does not have the fragmentation > > problem: > > > https://reviews.freebsd.org/D11560#239462 > > > I???m curious whether that has been confirmed. > > > > > > Is anyone working on the pathological case with 9k jumbo clusters in the > > > physical memory allocator? There was an interesting discussion started a > > > few years ago but I???m not sure what ever came of it: > > > http://docs.freebsd.org/cgi/mid.cgi?21225.20047.947384.390241 > > > > > > I have seen some work in the direction of avoiding larger than page size > > > jumbo clusters in 12-CURRENT. Many existing drivers avoid the 9k cluster > > > size already. The code for larger cluster sizes in iflib is #ifdef'd out > > > so it maxes out at the page size jumbo clusters until > > "CONTIGMALLOC_WORKS" > > > (apparently it doesn't). > > > > > > With all the changes due to iflib, is there any chance some of this will > > > get MFC'd to address the serious problem that remains in 11-STABLE? > > > > > > Otherwise, would it be feasible to disable the use of the 9k cluster pool > > > in at least some of the popular NIC drivers as a solution for the stable > > > branches? > > > > > > Finally, I have studied some of the driver code in 11-STABLE and posted > > the > > > gist of my notes in relation to this problem. If anyone spots a mistake > > or > > > has something else to contribute, comments on the gist would be greatly > > > appreciated! > > > https://gist.github.com/freqlabs/eba9b755f17a223260246becfbb150a1 > > > > Drivers need to be fixed to use 4k pages instead of cluster. I really hope > > no one is using a card that can't do 4k pages, or if they are, then they > > should get a real card that can do scatter/gather on 4k pages for jumbo > > frames.. > > > Yeah but it's 2018 and your server has like minimum a dozen million 4k > pages. > > So if you're doing stuff like lots of network packet kerchunking why not > have specialised allocator paths that can do things like "hey, always give > me 64k physical contig pages for storage/mbufs because you know what? > they're going to be allocated/freed together always." > > There was always a race between bus bandwidth, memory bandwidth and > bus/memory latencies. I'm not currently on the disk/packet pushing side of > things, but the last couple times I were it was at different points in that > 4d space and almost every single time there was a benefit from having a > couple of specialised allocators so you didn't have to try and manage a few > dozen million 4k pages based on your changing workload. > > I enjoy the 4k page size management stuff for my 128MB routers. Your 128G > server has a lot of 4k pages. It's a bit silly. We do: $vmstat -z ITEM SIZE LIMIT USED FREE REQ FAIL SLEEP [...] 8192: 8192, 0, 67, 109,398203019, 0, 0 16384: 16384, 0, 65, 41,74103020, 0, 0 32768: 32768, 0, 61, 28,41981659, 0, 0 65536: 65536, 0, 17, 23,26127059, 0, 0 [...] mbuf_jumbo_page: 4096, 509295, 0, 64,183536214, 0, 0 mbuf_jumbo_9k: 9216, 150902, 0, 0, 0, 0, 0 mbuf_jumbo_16k: 16384, 84882, 0, 0, 0, 0, 0 [...] And I know you know the problem is that over time memory is fragmented, so if suddenly you need more jumbo frames than you already have, you're SOL... page size allocations will always be available... Fixing drivers to fall back to 4k allocations (or always use 4k allocations) is a lot simplier than doing magic work to free pages, etc. -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not."