From owner-freebsd-hackers Mon Apr 8 21:50:28 2002 Delivered-To: freebsd-hackers@freebsd.org Received: from falcon.prod.itd.earthlink.net (falcon.mail.pas.earthlink.net [207.217.120.74]) by hub.freebsd.org (Postfix) with ESMTP id A323637B416 for ; Mon, 8 Apr 2002 21:50:24 -0700 (PDT) Received: from pool0021.cvx40-bradley.dialup.earthlink.net ([216.244.42.21] helo=mindspring.com) by falcon.prod.itd.earthlink.net with esmtp (Exim 3.33 #1) id 16unaI-0007Da-00; Mon, 08 Apr 2002 21:50:23 -0700 Message-ID: <3CB272F5.FF9D2C8F@mindspring.com> Date: Mon, 08 Apr 2002 21:49:57 -0700 From: Terry Lambert X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Andrew Gallatin Cc: freebsd-hackers@freebsd.org Subject: Re: performance of mbufs vs contig buffers? References: <15538.5971.620626.548508@grasshopper.cs.duke.edu> <3CB21FCF.6B018811@mindspring.com> <15538.14223.494295.766977@grasshopper.cs.duke.edu> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Andrew Gallatin wrote: > > My other guess would be that the clusters you are dealing > > with are non-contiguous. This has both scatter/gather > > implications, and cache-line implications when using them. > > Please elaborate... What sort of scatter/gather implications? > Microbenchmarks don't show much of a difference DMA'ing to > non-contigous vs. contigous pages. (over 400MB/sec in all cases). > Also, we get close to link speed DMA'ing to user space, and with page > coloring, that virtually guarantees that the pages are not physically > contigous. L2 cache busting would be an immediate result of scatter-gather DMA. And once you hit the pool size, then you would lose considerable speed to wait states. In general, cache lines are much larger than mbuf cluster sizes. > Based on the UDP behaviour, I think that its cache implications. The > bottleneck seems to be when copyout() reads the recently DMA'ed data. > The driver reads the first few dozen bytes (so as to touch up the csum > by subracting off the extra bits the DMA engines added in). We do > hardware csum offloading, so the entire packet is not read until > copyout() is called. I don't understand the copyout requirement here... > I seem to remember you talking about seeing a 10% speedup from using > 4MB pages for cluster mbufs. How did you do that? I'd like to see > what affect it has with this workload. I allocated them at system startup time, in machdep.c, out of contiguous physical memory, and then established 4M mappings for the data. Then I linked all the mbufs onto the mbuf free list, so that allocations would use my mbufs. The benefit was in the reduction in the amount of TLB thrashing that otherwise occurred. The overall speedup was closer to 16%. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message