From owner-freebsd-arch@FreeBSD.ORG Wed May 20 19:36:49 2009 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 803831065674 for ; Wed, 20 May 2009 19:36:49 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 45F308FC13 for ; Wed, 20 May 2009 19:36:49 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id E035046B7F; Wed, 20 May 2009 15:36:48 -0400 (EDT) Received: from jhbbsd.hudson-trading.com (unknown [209.249.190.8]) by bigwig.baldwin.cx (Postfix) with ESMTPA id DC7A68A026; Wed, 20 May 2009 15:36:47 -0400 (EDT) From: John Baldwin To: Jeff Roberson Date: Wed, 20 May 2009 15:22:58 -0400 User-Agent: KMail/1.9.7 References: <200905191458.50764.jhb@freebsd.org> In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200905201522.58501.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Wed, 20 May 2009 15:36:47 -0400 (EDT) X-Virus-Scanned: clamav-milter 0.95 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.5 required=4.2 tests=AWL,BAYES_00,RDNS_NONE autolearn=no version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: arch@freebsd.org Subject: Re: sglist(9) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 20 May 2009 19:36:49 -0000 On Wednesday 20 May 2009 2:49:30 pm Jeff Roberson wrote: > On Tue, 19 May 2009, John Baldwin wrote: > > > So one of the things I worked on while hacking away at unmapped disk I/O > > requests was a little API to manage scatter/gather lists of phyiscal > > addresses. The basic premise is that a sglist describes a logical object > > that is backed by one or more physical address ranges. To minimize locking, > > the sglist objects themselves are immutable once they are shared. The > > unmapped disk I/O project is still very much a WIP (and I'm not even working > > on any of the really hard bits myself). However, I actually found this > > object to be useful for something else I have been working on: the mmap() > > extensions for the Nvidia amd64 driver. For the Nvidia patches I have > > created a new type of VM object that is very similar to OBJT_DEVICE objects > > except that it uses a sglist to determine the physical pages backing the > > object instead of calling the d_mmap() method for each page. Anyway, adding > > this little API is just the first in a series of patches needed for the > > Nvidia driver work. I plan to MFC them to 7.x relatively soon in the hopes > > that we can soon have a supported Nvidia driver on amd64 on 7.x. > > > > The current patches for all the Nvidia stuff is at > > http://www.FreeBSD.org/~jhb/pat/ > > > > This particular patch to just add the sglist(9) API is at > > http://www.FreeBSD.org/~jhb/patches/sglist.patch and is slightly more > > polished in that it includes a manpage. :) > > I have a couple of minor comments: > > 1) SGLIST_APPEND() contains a return() within a macro. Shouldn't this be > an inline that returns an error code that is always checked? These kinds > of macros get people into trouble. It also could be written in such a way > that you don't have to handle nseg == 0 at each callsite and then it's big > enough that it probably shouldn't be a macro or an inline. Mostly I was trying to avoid having to duplicate a lot of code. The reason I didn't handle nseg == 0 directly is a possibly dubious attempt to optimize the _sglist_append() inline so that it doesn't have to do the extra branch inside the main loop for virtual address regions that span multiple pages. > 2) I worry that if all users do sglist_count() followed by a dynamic > allocation and then an _append() they will be very expensive. > pmap_kextract() is much more expensive than it may first seem to be. Do > you have a user of count already? The only one that does now is sglist_build() and nothing currently uses that. VOP_GET/PUTPAGES would not need to do this since they could simply append the physical addresses extracted directly from vm_page_t's for example. I'm not sure this will be used very much now as originally I thought I would be changing all storage drivers to do all DMA operations using sglists and this sort of thing would have been used for non-bio requests like firmware commands; however, as expounded on below, it actually appears better to still treat bio's separate from non-bio requests for bus_dma so that the non-bio requests can continue to use bus_dmamap_load_buffer() as they do now. > 3) Rather than having sg_segs be an actual pointer, did you consider > making it an unsized array? This removes the overhead of one pointer from > the structure while enforcing that it's always contiguously allocated. It's actually a feature to be able to have the header in separate storage from segs array. I use this in the jhb_bio branch in the bus_dma implementations where a pre-allocated segs array is stored in the bus dma tag and the header is allocated on the stack. > 4) SGLIST_INIT might be better off as an inline, and may not even belong > in the header file. That may be true. I currently only use it in the jhb_bio branch for the bus_dma implementations. > In general I think this is a good idea. It'd be nice to work on replacing > the buf layer's implementation with something like this that could be used > directly by drivers. Have you considered a busdma operation to load from > a sglist? So in regards to the bus_dma stuff, I did work on this a while ago in my jhb_bio branch. I do have a bus_dmamap_load_sglist() and I had planned on using that in storage drivers directly. However, I ended up circling back to preferring a bus_dmamap_load_bio() and adding a new 'bio_start' field to 'struct bio' that is an offset into an attached sglist. This let me carve up I/O requests in geom_dev to satisfy a disk device's max request size while still sharing the same read-only sglist across the various BIO's (by simply adjusting bio_length and bio_start to be a subrange of the sglist) as opposed to doing memory allocations to allocate specific ranges of an sglist (using something like sglist_slice()) for each I/O request. I then have bus_dmamap_load_bio() use the subrange of the sglist internally or fall back to using the KVA pointer if the sglist isn't present. However, I'm not really trying to get the bio stuff into the tree, this is mostly for the Nvidia case and for that use case the driver is simply creating simple single-entry lists and using sglist_append_phys(). An example of doing something like this is from my sample patdev test module where it creates a VM object that maps the local APIC uncacheable like so: /* Create a scatter/gather list that maps the local APIC. */ sc->sg = sglist_alloc(1, M_WAITOK); sglist_append_phys(sc->sg, lapic_paddr, LAPIC_LEN); /* Create a VM object that is backed by the scatter/gather list. */ sc->sgobj = vm_pager_allocate(OBJT_SG, sc->sg, LAPIC_LEN, VM_PROT_READ, 0); VM_OBJECT_LOCK(sc->sgobj); vm_object_set_cache_mode(sc->sgobj, VM_CACHE_UNCACHEABLE); VM_OBJECT_UNLOCK(sc->sgobj); The same approach can be used to map PCI BARs, etc. into userland as well. -- John Baldwin