From owner-freebsd-current@FreeBSD.ORG  Thu Jul 19 17:36:03 2007
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
X-Original-To: current@freebsd.org
Delivered-To: freebsd-current@FreeBSD.ORG
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id E275316A405
	for <current@freebsd.org>; Thu, 19 Jul 2007 17:36:03 +0000 (UTC)
	(envelope-from davidch@broadcom.com)
Received: from MMS3.broadcom.com (mms3.broadcom.com [216.31.210.19])
	by mx1.freebsd.org (Postfix) with ESMTP id BC29C13C4AA
	for <current@freebsd.org>; Thu, 19 Jul 2007 17:36:03 +0000 (UTC)
	(envelope-from davidch@broadcom.com)
Received: from [10.10.64.154] by MMS3.broadcom.com with ESMTP (Broadcom
	SMTP Relay (Email Firewall v6.3.1)); Thu, 19 Jul 2007 10:35:54 -0700
X-Server-Uuid: 20144BB6-FB76-4F11-80B6-E6B2900CA0D7
Received: by mail-irva-10.broadcom.com (Postfix, from userid 47) id
	B4ABB2AF; Thu, 19 Jul 2007 10:35:54 -0700 (PDT)
Received: from mail-irva-8.broadcom.com (mail-irva-8 [10.10.64.221]) by
	mail-irva-10.broadcom.com (Postfix) with ESMTP id 9FD052AE; Thu, 19 Jul
	2007 10:35:54 -0700 (PDT)
Received: from mail-irva-12.broadcom.com (mail-irva-12.broadcom.com
	[10.10.64.146]) by mail-irva-8.broadcom.com (MOS 3.7.5a-GA) with ESMTP
	id FMG64704; Thu, 19 Jul 2007 10:35:51 -0700 (PDT)
Received: from NT-IRVA-0750.brcm.ad.broadcom.com (
	nt-irva-0750.brcm.ad.broadcom.com [10.8.194.64]) by
	mail-irva-12.broadcom.com (Postfix) with ESMTP id 9E36C69CA6; Thu, 19
	Jul 2007 10:35:51 -0700 (PDT)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Date: Thu, 19 Jul 2007 10:35:50 -0700
Message-ID: <09BFF2FA5EAB4A45B6655E151BBDD9030483F728@NT-IRVA-0750.brcm.ad.broadcom.com>
In-Reply-To: <469EEF02.7000804@samsco.org>
Thread-Topic: Getting/Forcing Greater than 4KB Buffer Allocations
Thread-Index: AcfJwT2Ak+ZetsmzToitT+ZBnuucHAAZdwog
References: <09BFF2FA5EAB4A45B6655E151BBDD9030483F161@NT-IRVA-0750.brcm.ad.broadcom.com>
	<20070718021839.GA37935@cdnetworks.co.kr>
	<09BFF2FA5EAB4A45B6655E151BBDD9030483F437@NT-IRVA-0750.brcm.ad.broadcom.com>
	<20070719002218.GA42405@cdnetworks.co.kr>
	<09BFF2FA5EAB4A45B6655E151BBDD9030483F5D2@NT-IRVA-0750.brcm.ad.broadcom.com>
	<469EEF02.7000804@samsco.org>
From: "David Christensen" <davidch@broadcom.com>
To: "Scott Long" <scottl@samsco.org>
X-WSS-ID: 6A817F703AC26681417-01-01
Content-Type: text/plain;
 charset=us-ascii
Content-Transfer-Encoding: quoted-printable
Cc: pyunyh@gmail.com, current@freebsd.org
Subject: RE: Getting/Forcing Greater than 4KB Buffer Allocations
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 19 Jul 2007 17:36:04 -0000


> I'm trying to catch up on this thread, but I'm utterly confused as to
> what you're looking for.  Let's try talking through a few scenarios
> here:

My goal is simple.  I've modified my driver to support up to 8 segments
in an mbuf and I want to verify that it works correctly.  It's simple to
test when every mbuf has the same number of segments, but I want to make
sure my code is robust enough to support cases where one mbuf is made of
3 segments while the next is made of 5 segments.  The best case would be

to get a distribution of sizes from the min to the max (i.e. 1 to 8).
I'm not trying to test for performance, only for proper operation under
a worst case load.

>=20
> 1. Your hardware has slots for 3 SG elements, and all three MUST be
> filled without exception.  Therefore, you want segments that=20
> are 4k, 4k,
> and 1k (or some slight variation of that if the buffer is misaligned).
> To do this, set the maxsegs to 3 and the maxsegsize to 4k.  This will
> ensure that busdma does no coalescing (more on this topic later) and
> will always give you 3 segments for 9k of contiguous buffers.  If the
> actual buffer winds up being <=3D 8k, busdma won't guarantee that =
you'll
> get 3 segments, and you'll have to fake something up in your=20
> driver.  If
> the buffer winds up being an fragmented mbuf chain, it also won't
> guarantee that you'll get 3 segments either, but that's=20
> already handled
> now via m_defrag().

My hardware supports multiples of 255 buffer descriptors (255, 510,
765, etc.).  If all mbufs have 1 segment (common for 1500 MTU) then
I can handle multiples of 255 mbufs.  If all mbufs have 3 segments,
(common for 9000 MTU) then I can handle multiples of 85 mbufs.  If
the mbufs have varying number of segments (anywhere from 1 to 8)
then a varying number of mbufs can be buffered.  This last case is
the most complicated to handle and I want to make sure my code is
robust enough to handle it.  I've found that reducing the system=20
memory from 8GB to 2GB has allowed me to see both 2 segment and
3 segment mbufs (the former I assume occurs because of coalescing)
but I haven't been able to load the system in such a way to cause
any other number of segments to occur.

>=20
> 2. Your hardware can only handle 4k segments, but is less=20
> restrictive on
> the min/max number of segements.  The solution is the same as above.

No practical limit on the segment size.  Anything between 1 byte and=20
9KB is fine.

>=20
> 3. Your hardware has slots for 8 SG elements, and all 8 MUST be filled
> without exception.  There's no easy solution for this, as=20
> it's a fairly
> bizarre situation.  I'll only discuss it further if you confirm that
> it's actually the case here.

The number of SG elements can vary anywhere from 1 to 8.  If the first
SG element has 2 slots then there's no problem with the second SG
element having 8 slots, and then the third having 4 slots.  The only=20
difficulty comes in keeping the ring full since the number of slots
used won't always match the number of slots available.  I think I can
handle this correctly but it's difficult to test since all of the=20
SG entries have the same number of slots (which also happens to be=20
evenly divisible by the total number of slots available in the ring).

>=20
> As for coalescing segments, I'm considering a new busdma back-end that
> greatly streamlines loads by eliminating cycle-consuming tasks like
> segment coalescing.  The original justification for=20
> coalescing was that
> DMA engines operated faster with fewer segments.  That might still be
> true, but the extra host CPU cycles and cache-line misses probably
> result in a net loss.  I'm also going to axe bounce-buffer=20
> support since
> it bloats the I cache.  The target for this new back-end is=20
> drivers that
> support hardware that don't need these services and that are also
> sensitive to the amount of host CPU cycles being consumed, i.e. modern
> 1Gb and 10Gb adapters.  The question I have is whether this=20
> new back-end
> should be accessible directly through yet another bus_dmamap_load_foo
> variant that the drivers need to know specifically about, or=20
> indirectly
> and automatically via the existing bus_dmamap_load_foo variants.  The
> tradeoff is further API pollution vs the opportunity for even more
> efficiency through no indirect function calls and no cache misses from
> accessing the busdma tag.  I don't like API pollution since=20
> it makes it
> harder to maintain code, but the opportunity for the best performance
> possible is also appealing.

Others have reported that single, larger segments provide better=20
performance than multiple, smaller segments.  (Kip Macy recently
forwarded me a patch to test which shows a performance improvement
on the cxgb adatper when this is used.)  I haven't done enough=20
performance testing on bce to know if this helps overall, hurts,
or has no overall difference.  One thing I am interested in is
finding a way to allocate receive mbufs such that I can split the
header into a single buffer and then place the data into one or
more page aligned buffers, similar to what a transmit mbuf looks
like.  Anyway to support that in the current architecture?

Dave