Date: Thu, 19 Jul 2007 00:56:34 -0400 From: Scott Long <scottl@samsco.org> To: David Christensen <davidch@broadcom.com> Cc: pyunyh@gmail.com, current@freebsd.org Subject: Re: Getting/Forcing Greater than 4KB Buffer Allocations Message-ID: <469EEF02.7000804@samsco.org> In-Reply-To: <09BFF2FA5EAB4A45B6655E151BBDD9030483F5D2@NT-IRVA-0750.brcm.ad.broadcom.com> References: <09BFF2FA5EAB4A45B6655E151BBDD9030483F161@NT-IRVA-0750.brcm.ad.broadcom.com> <20070718021839.GA37935@cdnetworks.co.kr> <09BFF2FA5EAB4A45B6655E151BBDD9030483F437@NT-IRVA-0750.brcm.ad.broadcom.com> <20070719002218.GA42405@cdnetworks.co.kr> <09BFF2FA5EAB4A45B6655E151BBDD9030483F5D2@NT-IRVA-0750.brcm.ad.broadcom.com>
next in thread | previous in thread | raw e-mail | index | archive | help
David Christensen wrote: >> > Thanks Pyun but I'm really just looking for a way to test >> that I can >> > handle the number of segments I've advertised that I can >> support. I >> > believe my code is correct but when all I see are allocations of 3 >> > segments I just can't prove it. I was hoping that running >> a utility >> > such as "stress" would help fragment memory and force more variable >> > responses but that hasn't happened yet. >> > >> >> It seems you've used the following code to create jumbo dma tag. >> /* >> * Create a DMA tag for RX mbufs. >> */ >> if (bus_dma_tag_create(sc->parent_tag, >> 1, >> BCE_DMA_BOUNDARY, >> sc->max_bus_addr, >> BUS_SPACE_MAXADDR, >> NULL, NULL, >> MJUM9BYTES, >> BCE_MAX_SEGMENTS, >> MJUM9BYTES, >> ^^^^^^^^^^ >> 0, >> NULL, NULL, >> &sc->rx_mbuf_tag)) { >> BCE_PRINTF("%s(%d): Could not allocate RX >> mbuf DMA tag!\n", >> __FILE__, __LINE__); >> rc = ENOMEM; >> goto bce_dma_alloc_exit; >> } >> If you want to have > 9 dma segements change maxsegsz(MJUM9BYTES) to >> 1024. bus_dma honors maxsegsz argument so you wouldn't get a dma >> segments larger than maxsegsz. With MJUM9BYTES maxsegsz you would get >> up to 4 dma segments on systems with 4K PAGE_SIZE.(You would have >> got up to 3 dma segements if you used PAGE_SIZE alignment argument.) > > I don't want more segments, I just want to get a distribution of > segments > up to the max size I specified. For example, since my BCE_MAX_SEGMENTS > size is 8, I want to make sure I get mbufs that are spread over 1, 2, 3, > 4, 5, 6, 7, and 8 segments. > > It turns out if I reduce the amount of memory in the system (from 8GB to > 2GB) I will get more mbufs coalesced into 2 segments, rather than the > more typical 3 segments, but that's good enough for my testing now. > Dave, I'm trying to catch up on this thread, but I'm utterly confused as to what you're looking for. Let's try talking through a few scenarios here: 1. Your hardware has slots for 3 SG elements, and all three MUST be filled without exception. Therefore, you want segments that are 4k, 4k, and 1k (or some slight variation of that if the buffer is misaligned). To do this, set the maxsegs to 3 and the maxsegsize to 4k. This will ensure that busdma does no coalescing (more on this topic later) and will always give you 3 segments for 9k of contiguous buffers. If the actual buffer winds up being <= 8k, busdma won't guarantee that you'll get 3 segments, and you'll have to fake something up in your driver. If the buffer winds up being an fragmented mbuf chain, it also won't guarantee that you'll get 3 segments either, but that's already handled now via m_defrag(). 2. Your hardware can only handle 4k segments, but is less restrictive on the min/max number of segements. The solution is the same as above. 3. Your hardware has slots for 8 SG elements, and all 8 MUST be filled without exception. There's no easy solution for this, as it's a fairly bizarre situation. I'll only discuss it further if you confirm that it's actually the case here. As for coalescing segments, I'm considering a new busdma back-end that greatly streamlines loads by eliminating cycle-consuming tasks like segment coalescing. The original justification for coalescing was that DMA engines operated faster with fewer segments. That might still be true, but the extra host CPU cycles and cache-line misses probably result in a net loss. I'm also going to axe bounce-buffer support since it bloats the I cache. The target for this new back-end is drivers that support hardware that don't need these services and that are also sensitive to the amount of host CPU cycles being consumed, i.e. modern 1Gb and 10Gb adapters. The question I have is whether this new back-end should be accessible directly through yet another bus_dmamap_load_foo variant that the drivers need to know specifically about, or indirectly and automatically via the existing bus_dmamap_load_foo variants. The tradeoff is further API pollution vs the opportunity for even more efficiency through no indirect function calls and no cache misses from accessing the busdma tag. I don't like API pollution since it makes it harder to maintain code, but the opportunity for the best performance possible is also appealing. Scott
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?469EEF02.7000804>