Date: Sun, 29 Jul 2018 21:38:20 +0000 From: Rick Macklem <rmacklem@uoguelph.ca> To: Adrian Chadd <adrian.chadd@gmail.com>, "ryan@ixsystems.com" <ryan@ixsystems.com>, FreeBSD Net <freebsd-net@freebsd.org> Subject: Re: 9k jumbo clusters Message-ID: <YTOPR0101MB0953AE665C73D96D0B1E6BBADD280@YTOPR0101MB0953.CANPRD01.PROD.OUTLOOK.COM> In-Reply-To: <CAJ-VmomHQ%2BzcJ%2BHXAjMg9aS1RPZsdHy0tYjdKzjpwrUY%2B05NiQ@mail.gmail.com> References: <EBDE6EDD-D875-43D8-8D65-1F1344A6B817@ixsystems.com> <20180727221843.GZ2884@funkthat.com>, <CAJ-VmomHQ%2BzcJ%2BHXAjMg9aS1RPZsdHy0tYjdKzjpwrUY%2B05NiQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Adrian Chadd wrote: >John-Mark Gurney wrote: [stuff snipped] >> >> Drivers need to be fixed to use 4k pages instead of cluster. I really h= ope >> no one is using a card that can't do 4k pages, or if they are, then they >> should get a real card that can do scatter/gather on 4k pages for jumbo >> frames.. > >Yeah but it's 2018 and your server has like minimum a dozen million 4k >pages. > >So if you're doing stuff like lots of network packet kerchunking why not >have specialised allocator paths that can do things like "hey, always give >me 64k physical contig pages for storage/mbufs because you know what? >they're going to be allocated/freed together always." > >There was always a race between bus bandwidth, memory bandwidth and >bus/memory latencies. I'm not currently on the disk/packet pushing side of >things, but the last couple times I were it was at different points in tha= t >4d space and almost every single time there was a benefit from having a >couple of specialised allocators so you didn't have to try and manage a fe= w >dozen million 4k pages based on your changing workload. > >I enjoy the 4k page size management stuff for my 128MB routers. Your 128G >server has a lot of 4k pages. It's a bit silly. Here's my NFS guy perspective. I do think 9K mbuf clusters should go away. I'll note that I once coded NFS= so it would use 4K mbuf clusters for the big RPCs (write requests and read replie= s) and I actually could get the mbuf cluster pool fragmented to the point it stopp= ed working on a small machine, so it is possible (although not likely) to frag= ment even a 2K/4K mix. For me, send and receive are two very different cases: - For sending a large NFS RPC (lets say a reply to a 64K read), the NFS cod= e will generate a list of 33 2K mbuf clusters. If the net interface doesn't do T= SO, this is probably fine, since tcp_output() will end up busting this up into a b= unch of TCP segments using the list of mbuf clusters with TCP/IP headers added fo= r each segment, etc... - If the net interface does TSO, this long list goes down to the net driv= er and uses 34->35 ring entries to send it (it adds at least one segment for the MA= C header typically). If the driver isn't buggy and the net chip supports lots of= transmit ring entries, this works ok but... - If there was a 64K supercluster, the NFS code could easily use that for = the 64K of data and the TSO enabled net interface would use 2 transmit ring entr= ies. (one for the MAC/TCP/NFS header and one for the 64K of data). If the net= interface can't handle a TSO segment over 65535bytes, it will end up getting 2 TSO= segments from tcp_output(), but that still is a lot less than 35. I don't know enough about net hardware to know when/if this will help perf.= , but it seems that it might, at least for some chipsets? For receive, it seems that a 64K mbuf cluster is overkill for jumbo packets= , but as others have noted, they won't be allocated for long unless packets arrive o= ut of order, at least for NFS. (For other apps., they might not read the socket = for a while to get the data, so they might sit in the socket rcv queue for a while.) I chose 64K, since that is what most net interfaces can handle for TSO thes= e days. (If it will soon be larger, I think this should be even larger, but all of = them the same size to avoid fragmentation.) For the send case for NFS, it wouldn't even = need to be a very large pool, since they get free'd as soon as the net interface tr= ansmits the TSO segment. For NFS, it could easily call mget_supercl() and then fall back on the curr= ent code using 2K mbuf clusters if mget_supercl() failed, so a small pool w= ould be fine for the NFS send side. I'd like to see a pool for 64K or larger mbuf clusters for the send side. For the receive side, I'll let others figure out the best solution (4K or l= arger for jumbo clusters). I do think anything larger than 4K needs a separate al= location pool to avoid fragmentation. (I don't know, but I'd guess iSCSI could use them as well?) rick
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YTOPR0101MB0953AE665C73D96D0B1E6BBADD280>