From owner-freebsd-current@FreeBSD.ORG Thu Jul 19 17:36:03 2007 Return-Path: X-Original-To: current@freebsd.org Delivered-To: freebsd-current@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id E275316A405 for ; Thu, 19 Jul 2007 17:36:03 +0000 (UTC) (envelope-from davidch@broadcom.com) Received: from MMS3.broadcom.com (mms3.broadcom.com [216.31.210.19]) by mx1.freebsd.org (Postfix) with ESMTP id BC29C13C4AA for ; Thu, 19 Jul 2007 17:36:03 +0000 (UTC) (envelope-from davidch@broadcom.com) Received: from [10.10.64.154] by MMS3.broadcom.com with ESMTP (Broadcom SMTP Relay (Email Firewall v6.3.1)); Thu, 19 Jul 2007 10:35:54 -0700 X-Server-Uuid: 20144BB6-FB76-4F11-80B6-E6B2900CA0D7 Received: by mail-irva-10.broadcom.com (Postfix, from userid 47) id B4ABB2AF; Thu, 19 Jul 2007 10:35:54 -0700 (PDT) Received: from mail-irva-8.broadcom.com (mail-irva-8 [10.10.64.221]) by mail-irva-10.broadcom.com (Postfix) with ESMTP id 9FD052AE; Thu, 19 Jul 2007 10:35:54 -0700 (PDT) Received: from mail-irva-12.broadcom.com (mail-irva-12.broadcom.com [10.10.64.146]) by mail-irva-8.broadcom.com (MOS 3.7.5a-GA) with ESMTP id FMG64704; Thu, 19 Jul 2007 10:35:51 -0700 (PDT) Received: from NT-IRVA-0750.brcm.ad.broadcom.com ( nt-irva-0750.brcm.ad.broadcom.com [10.8.194.64]) by mail-irva-12.broadcom.com (Postfix) with ESMTP id 9E36C69CA6; Thu, 19 Jul 2007 10:35:51 -0700 (PDT) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Date: Thu, 19 Jul 2007 10:35:50 -0700 Message-ID: <09BFF2FA5EAB4A45B6655E151BBDD9030483F728@NT-IRVA-0750.brcm.ad.broadcom.com> In-Reply-To: <469EEF02.7000804@samsco.org> Thread-Topic: Getting/Forcing Greater than 4KB Buffer Allocations Thread-Index: AcfJwT2Ak+ZetsmzToitT+ZBnuucHAAZdwog References: <09BFF2FA5EAB4A45B6655E151BBDD9030483F161@NT-IRVA-0750.brcm.ad.broadcom.com> <20070718021839.GA37935@cdnetworks.co.kr> <09BFF2FA5EAB4A45B6655E151BBDD9030483F437@NT-IRVA-0750.brcm.ad.broadcom.com> <20070719002218.GA42405@cdnetworks.co.kr> <09BFF2FA5EAB4A45B6655E151BBDD9030483F5D2@NT-IRVA-0750.brcm.ad.broadcom.com> <469EEF02.7000804@samsco.org> From: "David Christensen" To: "Scott Long" X-WSS-ID: 6A817F703AC26681417-01-01 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Cc: pyunyh@gmail.com, current@freebsd.org Subject: RE: Getting/Forcing Greater than 4KB Buffer Allocations X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Jul 2007 17:36:04 -0000 > I'm trying to catch up on this thread, but I'm utterly confused as to > what you're looking for. Let's try talking through a few scenarios > here: My goal is simple. I've modified my driver to support up to 8 segments in an mbuf and I want to verify that it works correctly. It's simple to test when every mbuf has the same number of segments, but I want to make sure my code is robust enough to support cases where one mbuf is made of 3 segments while the next is made of 5 segments. The best case would be to get a distribution of sizes from the min to the max (i.e. 1 to 8). I'm not trying to test for performance, only for proper operation under a worst case load. >=20 > 1. Your hardware has slots for 3 SG elements, and all three MUST be > filled without exception. Therefore, you want segments that=20 > are 4k, 4k, > and 1k (or some slight variation of that if the buffer is misaligned). > To do this, set the maxsegs to 3 and the maxsegsize to 4k. This will > ensure that busdma does no coalescing (more on this topic later) and > will always give you 3 segments for 9k of contiguous buffers. If the > actual buffer winds up being <=3D 8k, busdma won't guarantee that = you'll > get 3 segments, and you'll have to fake something up in your=20 > driver. If > the buffer winds up being an fragmented mbuf chain, it also won't > guarantee that you'll get 3 segments either, but that's=20 > already handled > now via m_defrag(). My hardware supports multiples of 255 buffer descriptors (255, 510, 765, etc.). If all mbufs have 1 segment (common for 1500 MTU) then I can handle multiples of 255 mbufs. If all mbufs have 3 segments, (common for 9000 MTU) then I can handle multiples of 85 mbufs. If the mbufs have varying number of segments (anywhere from 1 to 8) then a varying number of mbufs can be buffered. This last case is the most complicated to handle and I want to make sure my code is robust enough to handle it. I've found that reducing the system=20 memory from 8GB to 2GB has allowed me to see both 2 segment and 3 segment mbufs (the former I assume occurs because of coalescing) but I haven't been able to load the system in such a way to cause any other number of segments to occur. >=20 > 2. Your hardware can only handle 4k segments, but is less=20 > restrictive on > the min/max number of segements. The solution is the same as above. No practical limit on the segment size. Anything between 1 byte and=20 9KB is fine. >=20 > 3. Your hardware has slots for 8 SG elements, and all 8 MUST be filled > without exception. There's no easy solution for this, as=20 > it's a fairly > bizarre situation. I'll only discuss it further if you confirm that > it's actually the case here. The number of SG elements can vary anywhere from 1 to 8. If the first SG element has 2 slots then there's no problem with the second SG element having 8 slots, and then the third having 4 slots. The only=20 difficulty comes in keeping the ring full since the number of slots used won't always match the number of slots available. I think I can handle this correctly but it's difficult to test since all of the=20 SG entries have the same number of slots (which also happens to be=20 evenly divisible by the total number of slots available in the ring). >=20 > As for coalescing segments, I'm considering a new busdma back-end that > greatly streamlines loads by eliminating cycle-consuming tasks like > segment coalescing. The original justification for=20 > coalescing was that > DMA engines operated faster with fewer segments. That might still be > true, but the extra host CPU cycles and cache-line misses probably > result in a net loss. I'm also going to axe bounce-buffer=20 > support since > it bloats the I cache. The target for this new back-end is=20 > drivers that > support hardware that don't need these services and that are also > sensitive to the amount of host CPU cycles being consumed, i.e. modern > 1Gb and 10Gb adapters. The question I have is whether this=20 > new back-end > should be accessible directly through yet another bus_dmamap_load_foo > variant that the drivers need to know specifically about, or=20 > indirectly > and automatically via the existing bus_dmamap_load_foo variants. The > tradeoff is further API pollution vs the opportunity for even more > efficiency through no indirect function calls and no cache misses from > accessing the busdma tag. I don't like API pollution since=20 > it makes it > harder to maintain code, but the opportunity for the best performance > possible is also appealing. Others have reported that single, larger segments provide better=20 performance than multiple, smaller segments. (Kip Macy recently forwarded me a patch to test which shows a performance improvement on the cxgb adatper when this is used.) I haven't done enough=20 performance testing on bce to know if this helps overall, hurts, or has no overall difference. One thing I am interested in is finding a way to allocate receive mbufs such that I can split the header into a single buffer and then place the data into one or more page aligned buffers, similar to what a transmit mbuf looks like. Anyway to support that in the current architecture? Dave