Date: Thu, 19 Jul 2007 10:35:50 -0700 From: "David Christensen" <davidch@broadcom.com> To: "Scott Long" <scottl@samsco.org> Cc: pyunyh@gmail.com, current@freebsd.org Subject: RE: Getting/Forcing Greater than 4KB Buffer Allocations Message-ID: <09BFF2FA5EAB4A45B6655E151BBDD9030483F728@NT-IRVA-0750.brcm.ad.broadcom.com> In-Reply-To: <469EEF02.7000804@samsco.org> References: <09BFF2FA5EAB4A45B6655E151BBDD9030483F161@NT-IRVA-0750.brcm.ad.broadcom.com> <20070718021839.GA37935@cdnetworks.co.kr> <09BFF2FA5EAB4A45B6655E151BBDD9030483F437@NT-IRVA-0750.brcm.ad.broadcom.com> <20070719002218.GA42405@cdnetworks.co.kr> <09BFF2FA5EAB4A45B6655E151BBDD9030483F5D2@NT-IRVA-0750.brcm.ad.broadcom.com> <469EEF02.7000804@samsco.org>
next in thread | previous in thread | raw e-mail | index | archive | help
> I'm trying to catch up on this thread, but I'm utterly confused as to > what you're looking for. Let's try talking through a few scenarios > here: My goal is simple. I've modified my driver to support up to 8 segments in an mbuf and I want to verify that it works correctly. It's simple to test when every mbuf has the same number of segments, but I want to make sure my code is robust enough to support cases where one mbuf is made of 3 segments while the next is made of 5 segments. The best case would be to get a distribution of sizes from the min to the max (i.e. 1 to 8). I'm not trying to test for performance, only for proper operation under a worst case load. >=20 > 1. Your hardware has slots for 3 SG elements, and all three MUST be > filled without exception. Therefore, you want segments that=20 > are 4k, 4k, > and 1k (or some slight variation of that if the buffer is misaligned). > To do this, set the maxsegs to 3 and the maxsegsize to 4k. This will > ensure that busdma does no coalescing (more on this topic later) and > will always give you 3 segments for 9k of contiguous buffers. If the > actual buffer winds up being <=3D 8k, busdma won't guarantee that = you'll > get 3 segments, and you'll have to fake something up in your=20 > driver. If > the buffer winds up being an fragmented mbuf chain, it also won't > guarantee that you'll get 3 segments either, but that's=20 > already handled > now via m_defrag(). My hardware supports multiples of 255 buffer descriptors (255, 510, 765, etc.). If all mbufs have 1 segment (common for 1500 MTU) then I can handle multiples of 255 mbufs. If all mbufs have 3 segments, (common for 9000 MTU) then I can handle multiples of 85 mbufs. If the mbufs have varying number of segments (anywhere from 1 to 8) then a varying number of mbufs can be buffered. This last case is the most complicated to handle and I want to make sure my code is robust enough to handle it. I've found that reducing the system=20 memory from 8GB to 2GB has allowed me to see both 2 segment and 3 segment mbufs (the former I assume occurs because of coalescing) but I haven't been able to load the system in such a way to cause any other number of segments to occur. >=20 > 2. Your hardware can only handle 4k segments, but is less=20 > restrictive on > the min/max number of segements. The solution is the same as above. No practical limit on the segment size. Anything between 1 byte and=20 9KB is fine. >=20 > 3. Your hardware has slots for 8 SG elements, and all 8 MUST be filled > without exception. There's no easy solution for this, as=20 > it's a fairly > bizarre situation. I'll only discuss it further if you confirm that > it's actually the case here. The number of SG elements can vary anywhere from 1 to 8. If the first SG element has 2 slots then there's no problem with the second SG element having 8 slots, and then the third having 4 slots. The only=20 difficulty comes in keeping the ring full since the number of slots used won't always match the number of slots available. I think I can handle this correctly but it's difficult to test since all of the=20 SG entries have the same number of slots (which also happens to be=20 evenly divisible by the total number of slots available in the ring). >=20 > As for coalescing segments, I'm considering a new busdma back-end that > greatly streamlines loads by eliminating cycle-consuming tasks like > segment coalescing. The original justification for=20 > coalescing was that > DMA engines operated faster with fewer segments. That might still be > true, but the extra host CPU cycles and cache-line misses probably > result in a net loss. I'm also going to axe bounce-buffer=20 > support since > it bloats the I cache. The target for this new back-end is=20 > drivers that > support hardware that don't need these services and that are also > sensitive to the amount of host CPU cycles being consumed, i.e. modern > 1Gb and 10Gb adapters. The question I have is whether this=20 > new back-end > should be accessible directly through yet another bus_dmamap_load_foo > variant that the drivers need to know specifically about, or=20 > indirectly > and automatically via the existing bus_dmamap_load_foo variants. The > tradeoff is further API pollution vs the opportunity for even more > efficiency through no indirect function calls and no cache misses from > accessing the busdma tag. I don't like API pollution since=20 > it makes it > harder to maintain code, but the opportunity for the best performance > possible is also appealing. Others have reported that single, larger segments provide better=20 performance than multiple, smaller segments. (Kip Macy recently forwarded me a patch to test which shows a performance improvement on the cxgb adatper when this is used.) I haven't done enough=20 performance testing on bce to know if this helps overall, hurts, or has no overall difference. One thing I am interested in is finding a way to allocate receive mbufs such that I can split the header into a single buffer and then place the data into one or more page aligned buffers, similar to what a transmit mbuf looks like. Anyway to support that in the current architecture? Dave
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?09BFF2FA5EAB4A45B6655E151BBDD9030483F728>