Date: Thu, 20 Jun 2002 12:25:58 -0400 (EDT) From: Andrew Gallatin <gallatin@cs.duke.edu> To: Bosko Milekic <bmilekic@unixdaemons.com> Cc: "Kenneth D. Merry" <ken@kdm.org>, current@FreeBSD.ORG, net@FreeBSD.ORG Subject: Re: new zero copy sockets snapshot Message-ID: <15634.534.696063.241224@grasshopper.cs.duke.edu> In-Reply-To: <20020620114511.A22413@unixdaemons.com> References: <20020618223635.A98350@panzer.kdm.org> <xzpelf3ida1.fsf@flood.ping.uio.no> <20020619090046.A2063@panzer.kdm.org> <20020619120641.A18434@unixdaemons.com> <15633.17238.109126.952673@grasshopper.cs.duke.edu> <20020619233721.A30669@unixdaemons.com> <15633.62357.79381.405511@grasshopper.cs.duke.edu> <20020620114511.A22413@unixdaemons.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Bosko Milekic writes: > > Years ago, I used Wollman's MCLBYTES > PAGE_SIZE support (introduced > > in rev 1.20 of uipc_mbuf.c) and it seemed to work OK then. But having > > 16K clusters is a huge waste of space. ;). > > Since then, the mbuf allocator in -CURRENT has totally changed. It is > still possible to provide allocations of > PAGE_SIZE buffers, however > they will likely not map physically contiguous memory. If you happen to > have a device that doesn't support scatter/gather for DMA, then these > buffers will be broken for it (I know that if_ti is not a problem). Actually, it will be a problem for if_ti. The original tigon 1s didn't support s/g DMA. I think we should just not support jumbo frames on tigon 1s.. > The other issue is that the mbuf allocator then as well as the new > mbuf allocator uses the kmem_malloc() interface that was also used by > malloc() to perform allocations of wired-down pages. I am not sure if > you'll be able to play those tricks where you unmap and remap the page > that is allocated for you once it comes out of the mbuf allocator. Do > you think it would work? I don't think so, but I haven't read the code carefully and I don't know for certain. However, my intent was to use a jumbo mbuf type for copyin and to clean up the existing infastructure for drivers w/brain dead firmware, not to use a new 10K cluster as a framework for zero-copy. > > Do you think it would be feasable to glue in a new jumbo (10K?) > > allocator on top of the existing mbuf and mcl allocators using the > > existing mechanisms and the existing MCLBYTES > PAGE_SIZE support > > (but broken out into separte functions and macros)? > > Assuming that you can still play those VM tricks with the pages spit > out by mb_alloc (kern/subr_mbuf.c in -CURRENT), then this wouldn't be a > problem at all. It's easy to add a new fixed-size type allocation to > mb_alloc. In fact, it would be beneficial. mb_alloc uses per-CPU > caches and also makes mbuf and cluster allocations share the same > per-CPU lock. What could be done is that the jumbo buffer allocations > could share the same lock as well (since they will likely usually be > allocated right after an mbuf is). This would give us jumbo-cluster > support, but it would only be useful for devices clued enough to break > up the cluster into PAGE_SIZE chunks and do scatter/gather. For most > worthy gigE devices, I don't think this should be a problem. I'm a bit worried about other devices.. Tradidtionally, mbufs have never crossed page boundaries so most drivers never bother to check for a transmit mbuf crossing a page boundary. Using physically discontigous mbufs could lead to a lot of subtle data corruption. One question. I've observed some really anomolous behaviour under -stable with my Myricom GM driver (2Gb/s + 2Gb/s link speed, Dual 1GHz pIII). When I use 4K mbufs for receives, the best speed I see is about 1300Mb/sec. However, if I use private 9K physically contiguous buffers I see 1850Mb/sec (iperf TCP). The obvious conclusion is that there's a lot of overhead in setting up the DMA engines, but that's not the case; we have a fairly quick chain dma engine. I've provided a "control" by breaking my contiguous buffers down into 4K chunks so that I do the same number of DMAs in both cases and I still see ~1850 Mb/sec for the 9K buffers. A coworker suggested that the problem was that when doing copyouts to userspace, the PIII was doing speculative reads and loading the cache with the next page. However, we then start copying from a totally different address using discontigous buffers, so we effectively take 2x the number of cache misses we'd need to. Does that sound reasonable to you? I need to try malloc'ing virtually contigous and physically discontigous buffers & see if I get the same (good) performance... Cheers, Drew To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?15634.534.696063.241224>