FreeBSD Mail Archives

Date:      Thu, 19 Jul 2007 10:35:50 -0700
From:      "David Christensen" <davidch@broadcom.com>
To:        "Scott Long" <scottl@samsco.org>
Cc:        pyunyh@gmail.com, current@freebsd.org
Subject:   RE: Getting/Forcing Greater than 4KB Buffer Allocations
Message-ID:  <09BFF2FA5EAB4A45B6655E151BBDD9030483F728@NT-IRVA-0750.brcm.ad.broadcom.com>
In-Reply-To: <469EEF02.7000804@samsco.org>
References:  <09BFF2FA5EAB4A45B6655E151BBDD9030483F161@NT-IRVA-0750.brcm.ad.broadcom.com> <20070718021839.GA37935@cdnetworks.co.kr> <09BFF2FA5EAB4A45B6655E151BBDD9030483F437@NT-IRVA-0750.brcm.ad.broadcom.com> <20070719002218.GA42405@cdnetworks.co.kr> <09BFF2FA5EAB4A45B6655E151BBDD9030483F5D2@NT-IRVA-0750.brcm.ad.broadcom.com> <469EEF02.7000804@samsco.org>


> I'm trying to catch up on this thread, but I'm utterly confused as to
> what you're looking for.  Let's try talking through a few scenarios
> here:

My goal is simple.  I've modified my driver to support up to 8 segments
in an mbuf and I want to verify that it works correctly.  It's simple to
test when every mbuf has the same number of segments, but I want to make
sure my code is robust enough to support cases where one mbuf is made of
3 segments while the next is made of 5 segments.  The best case would be

to get a distribution of sizes from the min to the max (i.e. 1 to 8).
I'm not trying to test for performance, only for proper operation under
a worst case load.

>=20
> 1. Your hardware has slots for 3 SG elements, and all three MUST be
> filled without exception.  Therefore, you want segments that=20
> are 4k, 4k,
> and 1k (or some slight variation of that if the buffer is misaligned).
> To do this, set the maxsegs to 3 and the maxsegsize to 4k.  This will
> ensure that busdma does no coalescing (more on this topic later) and
> will always give you 3 segments for 9k of contiguous buffers.  If the
> actual buffer winds up being <=3D 8k, busdma won't guarantee that =
you'll
> get 3 segments, and you'll have to fake something up in your=20
> driver.  If
> the buffer winds up being an fragmented mbuf chain, it also won't
> guarantee that you'll get 3 segments either, but that's=20
> already handled
> now via m_defrag().

My hardware supports multiples of 255 buffer descriptors (255, 510,
765, etc.).  If all mbufs have 1 segment (common for 1500 MTU) then
I can handle multiples of 255 mbufs.  If all mbufs have 3 segments,
(common for 9000 MTU) then I can handle multiples of 85 mbufs.  If
the mbufs have varying number of segments (anywhere from 1 to 8)
then a varying number of mbufs can be buffered.  This last case is
the most complicated to handle and I want to make sure my code is
robust enough to handle it.  I've found that reducing the system=20
memory from 8GB to 2GB has allowed me to see both 2 segment and
3 segment mbufs (the former I assume occurs because of coalescing)
but I haven't been able to load the system in such a way to cause
any other number of segments to occur.

>=20
> 2. Your hardware can only handle 4k segments, but is less=20
> restrictive on
> the min/max number of segements.  The solution is the same as above.

No practical limit on the segment size.  Anything between 1 byte and=20
9KB is fine.

>=20
> 3. Your hardware has slots for 8 SG elements, and all 8 MUST be filled
> without exception.  There's no easy solution for this, as=20
> it's a fairly
> bizarre situation.  I'll only discuss it further if you confirm that
> it's actually the case here.

The number of SG elements can vary anywhere from 1 to 8.  If the first
SG element has 2 slots then there's no problem with the second SG
element having 8 slots, and then the third having 4 slots.  The only=20
difficulty comes in keeping the ring full since the number of slots
used won't always match the number of slots available.  I think I can
handle this correctly but it's difficult to test since all of the=20
SG entries have the same number of slots (which also happens to be=20
evenly divisible by the total number of slots available in the ring).

>=20
> As for coalescing segments, I'm considering a new busdma back-end that
> greatly streamlines loads by eliminating cycle-consuming tasks like
> segment coalescing.  The original justification for=20
> coalescing was that
> DMA engines operated faster with fewer segments.  That might still be
> true, but the extra host CPU cycles and cache-line misses probably
> result in a net loss.  I'm also going to axe bounce-buffer=20
> support since
> it bloats the I cache.  The target for this new back-end is=20
> drivers that
> support hardware that don't need these services and that are also
> sensitive to the amount of host CPU cycles being consumed, i.e. modern
> 1Gb and 10Gb adapters.  The question I have is whether this=20
> new back-end
> should be accessible directly through yet another bus_dmamap_load_foo
> variant that the drivers need to know specifically about, or=20
> indirectly
> and automatically via the existing bus_dmamap_load_foo variants.  The
> tradeoff is further API pollution vs the opportunity for even more
> efficiency through no indirect function calls and no cache misses from
> accessing the busdma tag.  I don't like API pollution since=20
> it makes it
> harder to maintain code, but the opportunity for the best performance
> possible is also appealing.

Others have reported that single, larger segments provide better=20
performance than multiple, smaller segments.  (Kip Macy recently
forwarded me a patch to test which shows a performance improvement
on the cxgb adatper when this is used.)  I haven't done enough=20
performance testing on bce to know if this helps overall, hurts,
or has no overall difference.  One thing I am interested in is
finding a way to allocate receive mbufs such that I can split the
header into a single buffer and then place the data into one or
more page aligned buffers, similar to what a transmit mbuf looks
like.  Anyway to support that in the current architecture?

Dave

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?09BFF2FA5EAB4A45B6655E151BBDD9030483F728>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation