From owner-freebsd-arch@FreeBSD.ORG Wed Sep 26 17:44:39 2007 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BC96D16A417 for ; Wed, 26 Sep 2007 17:44:39 +0000 (UTC) (envelope-from hselasky@c2i.net) Received: from swip.net (mailfe05.swip.net [212.247.154.129]) by mx1.freebsd.org (Postfix) with ESMTP id 41E7413C447 for ; Wed, 26 Sep 2007 17:44:38 +0000 (UTC) (envelope-from hselasky@c2i.net) X-Cloudmark-Score: 0.000000 [] Received: from [85.19.218.45] (account mc467741@c2i.net [85.19.218.45] verified) by mailfe05.swip.net (CommuniGate Pro SMTP 5.1.10) with ESMTPA id 527612352; Wed, 26 Sep 2007 18:44:35 +0200 From: Hans Petter Selasky To: freebsd-arch@freebsd.org, John-Mark Gurney Date: Wed, 26 Sep 2007 18:44:55 +0200 User-Agent: KMail/1.9.7 References: <200709260131.49156.hselasky@c2i.net> <20070926045401.GB47467@funkthat.com> In-Reply-To: <20070926045401.GB47467@funkthat.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200709261844.56182.hselasky@c2i.net> Cc: Subject: Re: Request for feedback on common data backstore in the kernel X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Sep 2007 17:44:39 -0000 Hi John-Mark, See my comments below. On Wednesday 26 September 2007, John-Mark Gurney wrote: > Hans Petter Selasky wrote this message on Wed, Sep 26, 2007 at 01:31 +0200: > > Please keep me CC'ed, hence I'm not on all these lists. > > > > In the kernel we currently have two different data backstores: > > > > struct mbuf > > > > and > > > > struct buf > > > > These two backstores serve two different device types. "mbufs" are for > > network devices and "buf" is for disk devices. > > I don't see how this relates to the rest of your email, but even though > they are used similarly, their normal size is quite different... mbufs > normally contain 64-256 byte packets, w/ large file transfers attaching > a 2k cluster (which comes from a different pool than the core mbuf) to > the mbuf... buf is usually something like 16k-64k... > > > Problem: > > > > The current backstores are loaded into DMA by using the BUS-DMA > > framework. This appears not to be too fast according to Kip Macy. See: > > > > http://perforce.freebsd.org/chv.cgi?CH=126455 > > This only works on x86/amd64 because of the direct mapped memory that > they support.. This would complete break arches like sparc64 that > require an iommu to translate the addresses... and also doesn't address > keeping the buffers in sync on arches like arm... sparc64 may have many > gigs of memory, but only a 2GB window for mapping main memory... > > It sounds like the x86/amd64 bus_dma implementation needs to be improved > to run more quickly... As w/ all things, you can hardcode stuff, but then > you loose portability... Correct. > > > Some ideas I have: > > > > When a buffer is out out of range for a hardware device and a data-copy > > is needed I want to simply copy that data in smaller parts to/from a > > pre-allocated bounce buffer. I want to avoid allocating this buffer when > > "bus_dmamap_load()" is called. > > > > For pre-allocated USB DMA memory I currently have: > > > > struct usbd_page > > > > struct usbd_page { > > void *buffer; // virtual address > > bus_size_t physaddr; // as seen by one of my devices > > bus_dma_tag_t tag; > > bus_dmamap_t map; > > uint32_t length; > > }; > > > > Mostly only "length == PAGE_SIZE" is allowed. When USB allocates DMA > > memory it allocates the same size all the way and that is PAGE_SIZE > > bytes. > > I could see attaching preallocated memory to a tag, and having maps > that attempt to use this memory, but that's something else... > > > If two different PCI controllers want to communicate directly passing DMA > > buffers, technically one would need to translate the physical address for > > device 1 to the physical address as seen by device 2. If this translation > > table is sorted, the search will be rather quick. Another approach is to > > limit the number of translations: > > > > #define N_MAX_PCI_TRANSLATE 4 > > > > struct usbd_page { > > void *buffer; // virtual address > > bus_size_t physaddr[N_MAX_PCI_TRANSLATE]; > > bus_dma_tag_t tag; > > bus_dmamap_t map; > > uint32_t length; > > }; > > > > Then PCI device 1 on bus X can use physaddr[0] and PCI device 2 on bus Y > > can use physaddr[1]. If the physaddr[] is equal to some magic then the > > DMA buffer is not reachable and must be bounced. > > > > Then when two PCI devices talk together all they need to pass is a > > structure like this: > > > > struct usbd_page_cache { > > struct usbd_page *page_start; > > uint32_t page_offset_buf; > > uint32_t page_offset_end; > > }; > > > > And the required DMA address is looked up in some nanos. > > > > Has someone been thinking about this topic before ? > > There is no infastructure to support passing dma address between hardware > devices, and is complete unrelated to the issues raised above... This > requires the ability to pass in a map to a tag and create a new map... > It is possible, as on the sun4v where you have two iommu's.. You'd have > to program on iommu to point to the other one, to support that... But > it is rare to see devices to dma directly to each other... You usually > end up dma'ing to main memory, and then having the other device dma it > out of memory.. The only time you need to dma between devices is if one > has local memory, and the other device is able to sanely populate it... > This is very rare... What I meant was that USB dma directly into main memory. But then another PCI device like an Ethernet device might want to forward that data by dma'ing it out of main memory. The dma address as seen by the two different PCI devices might not be the same. I admit that I'm not an expert on how DMA is done on the Sparc, but could you explain a little bit more how a mbuf is loaded into DMA for a network card on the Sparc ? What I'm looking for is a function that transforms a virtual memory address and a bus-DMA tag into a physical address without blocking. If a mapping is not possible I want that an error be returned so that I can bounce the data using a pre-allocate buffer, and not a buffer allocated by bus_dma on the fly. > > Also, the PCI bus length can get quite long.. With PCIe, each device is > now it's own PCI bus, so you're starting to see PCI bus counts in the > 10's and 20's, if not higher.. having an area of all of those, and > calculating them and filling them out sounds like a huge expense... > > I'm a bit puzzeled as to what you wanted to solve, as the problem you > stated doesn't relate to the solutions you were thinking about... Maybe > I'm missing something? Can you give me an example of where cxgb is > writing to the memory on another pci bus, and not main memory? This is not the case with USB. The only example I know if is some TV-cards that write directly into the frame buffer of the video-card. --HPS