From owner-freebsd-arch@FreeBSD.ORG Sun Nov 29 23:44:42 2009 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1A3261065676 for ; Sun, 29 Nov 2009 23:44:42 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.freebsd.org (Postfix) with ESMTP id 7B73C8FC0A for ; Sun, 29 Nov 2009 23:44:41 +0000 (UTC) Received: from [IPv6:::1] (pooker.samsco.org [168.103.85.57]) (authenticated bits=0) by pooker.samsco.org (8.14.2/8.14.2) with ESMTP id nATNRnnW057251; Sun, 29 Nov 2009 16:27:49 -0700 (MST) (envelope-from scottl@samsco.org) Mime-Version: 1.0 (Apple Message framework v1076) From: Scott Long In-Reply-To: <3bbf2fe10911291429k54b4b7cfw9e40aefeca597307@mail.gmail.com> Date: Sun, 29 Nov 2009 16:27:49 -0700 Message-Id: <66707B0F-D0AB-49DB-802F-13146F488E1A@samsco.org> References: <200905191458.50764.jhb@freebsd.org> <200905201522.58501.jhb@freebsd.org> <3bbf2fe10911291429k54b4b7cfw9e40aefeca597307@mail.gmail.com> To: John Baldwin X-Mailer: Apple Mail (2.1076) X-Spam-Status: No, score=-4.4 required=3.8 tests=ALL_TRUSTED,AWL,BAYES_00, HTML_MESSAGE autolearn=ham version=3.1.8 X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on pooker.samsco.org Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: Attilio Rao , arch@freebsd.org Subject: Re: sglist(9) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 29 Nov 2009 23:44:42 -0000 John, Sorry for the late reply on this. Attilio approached me recently about moving busdma and storage to sglists; up until then I had largely ignored this conversation because I thought that it was only about the nvidia driver. > On Wednesday 20 May 2009 2:49:30 pm Jeff Roberson wrote: >> On Tue, 19 May 2009, John Baldwin wrote: >> >> 2) I worry that if all users do sglist_count() followed by a dynamic >> allocation and then an _append() they will be very expensive. >> pmap_kextract() is much more expensive than it may first seem to >> be. Do >> you have a user of count already? > > The only one that does now is sglist_build() and nothing currently > uses that. Kinda silly to have it then. I also don't see the point of it; if the point of the sglist object is to avoid VA mappings, then why start with a VA mapping? But that aside, Jeff is correct, sglist_build() is horribly inefficient. > VOP_GET/PUTPAGES would not need to do this since they could simply > append > the physical addresses extracted directly from vm_page_t's for > example. I'm > not sure this will be used very much now as originally I thought I > would be > changing all storage drivers to do all DMA operations using sglists > and this > sort of thing would have been used for non-bio requests like firmware > commands; however, as expounded on below, it actually appears better > to > still treat bio's separate from non-bio requests for bus_dma so that > the > non-bio requests can continue to use bus_dmamap_load_buffer() as > they do > now. > I completely disagree with your approach to busdma, but I'll get into that later. What I really don't understand is, why have yet another page description format? Whether the I/O is disk buffers being pushed by a pager to a disk controller, or texture buffers being pushed to video hardware, they already have vm objects associated with them, no? Why translate to an intermediate format? I understand that you're creating another vm object format to deal directly with the needs of nvidia, but that's really a one-off case right now. What about when we want the pagedaemon to push unmapped i/o? Will it have to spend cycles translating its objects to sglists? This is really a larger question that I'm not prepared to answer, but would like to discuss. >> 3) Rather than having sg_segs be an actual pointer, did you consider >> making it an unsized array? This removes the overhead of one >> pointer from >> the structure while enforcing that it's always contiguously >> allocated. > > It's actually a feature to be able to have the header in separate > storage from > segs array. I use this in the jhb_bio branch in the bus_dma > implementations > where a pre-allocated segs array is stored in the bus dma tag and > the header > is allocated on the stack. > I'd like to take this one step further. Instead of sglists being exactly sized, I'd like to see them be much more like mbufs, with a header and static storage, maybe somewhere between 128b and 1k in total size. Then they can be allocated and managed in pools, and chained together to make for easy appending, splitting, and growing. Offset pointers can be stored in the header instead of externally. Also, there are a lot of failure points in the API regarding to the sglist object being too small. Those need to be fixed. >> 4) SGLIST_INIT might be better off as an inline, and may not even >> belong >> in the header file. > > That may be true. I currently only use it in the jhb_bio branch for > the > bus_dma implementations. > >> In general I think this is a good idea. It'd be nice to work on >> replacing >> the buf layer's implementation with something like this that could >> be used >> directly by drivers. Have you considered a busdma operation to >> load from >> a sglist? > > So in regards to the bus_dma stuff, I did work on this a while ago > in my > jhb_bio branch. I do have a bus_dmamap_load_sglist() and I had > planned on > using that in storage drivers directly. However, I ended up > circling back > to preferring a bus_dmamap_load_bio() and adding a new 'bio_start' > field > to 'struct bio' that is an offset into an attached sglist. I strongly disagree with forcing busdma to have intimate knowledge of bio's. All of the needed information can be stored in sglist headers. > This let me > carve up I/O requests in geom_dev to satisfy a disk device's max > request > size while still sharing the same read-only sglist across the various > BIO's (by simply adjusting bio_length and bio_start to be a subrange > of > the sglist) as opposed to doing memory allocations to allocate > specific > ranges of an sglist (using something like sglist_slice()) for each I/O > request. I think this is fundamentally wrong. You're proposing exchanging a cheap operation of splitting VA's with an expensive operation of allocating, splitting, copying, and refcounting sglists. Splitting is an excessively common operation, and your proposal will impact performance as storage becomes exponentially faster. We need to stop thinking about maxio as a roadbump at the bottom of the storage stack, and instead think of it as a fundamental attribute that is honored at the top when a BIO is created. Instead of loading up an sglist with all of the pages (and don't forget coalesced pages that might need to be broken up), maybe multiple bio's are created that honor maxio from the start, or a single bio with a chained sglist, with each chain link honoring maxio, allowing for easy splitting. > I then have bus_dmamap_load_bio() use the subrange of the > sglist internally or fall back to using the KVA pointer if the sglist > isn't present. I completely disagree. Drivers already deal with the details of bio's, and should continue to do so. If a driver gets a bio that has a valid bio_data pointer, it should call bus_dmamap_load(). If it get's one with a valid sglist, it should call bus_dmamap_load_sglist (). Your proposal means that every storage driver in the system will have to change to use bus_dmamap_load_bio(). It's not a big change, but it's disruptive both in the tree and out. Your proposal also implies that CAM will have to start carrying BIO's in CCBs and passing them to their SIMs. I absolutely disagree with this. If we keep unneeded complications out of busdma, we avoid a lot of churn. We also leave the busdma interface available for other forms of I/O without requiring more specific APi additions to accommodate them. What about unmapped network i/o coming from something like sendfile? > > However, I'm not really trying to get the bio stuff into the tree, > this > is mostly for the Nvidia case and for that use case the driver is > simply > creating simple single-entry lists and using sglist_append_phys(). > Designing the whole API around a single driver that we can't even get the source to makes it hard to evaluate the API. Attilio and I have spoken about this in private and will begin work on a prototype. Here is the outline of what we're going to do: 1. Change struct sglist as so: a. Uniform size b. Determine an optimal number of elements to include in the size (waving my hands here, more research is needed). c. Chain, offset, and length pointers, very similar to how mbufs already work 2. Expand the sglist API so that I/O producers can allocate slabs of sglists and slice them up into pools that they can manage and own 3. Add an sglist field to struct bio, and add appropriate flags to identify VA vs sglist operation 4. Extend the CAM_DATA_PHYS attributes in CAM to handle sglists. 5. Add bus_dmamap_load_sglist(). This will be able to walk chains and combine, split, and reassign segments as needed. 6. Modify a select number of drivers to use it. 7. Add a flag to disk->d_flags to signal if a driver can handle sglists. Have geom_dev look at this flag and generate a kmem_alloc_nofault+pmap_kenter sequence for drivers that can't support it. In the end, no drivers will need to change, but the ones that do change will obviously benefit. We're going to prototype this will an i/o source that starts unmapped (via the Xen blkback driver). The downside is that most GEOM transforms that need to touch the data won't work, but that's something that can be addressed once the prototype is done and evaluated. Scott