From owner-freebsd-arch@FreeBSD.ORG  Sun Nov 29 23:44:42 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1A3261065676
	for <arch@freebsd.org>; Sun, 29 Nov 2009 23:44:42 +0000 (UTC)
	(envelope-from scottl@samsco.org)
Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57])
	by mx1.freebsd.org (Postfix) with ESMTP id 7B73C8FC0A
	for <arch@freebsd.org>; Sun, 29 Nov 2009 23:44:41 +0000 (UTC)
Received: from [IPv6:::1] (pooker.samsco.org [168.103.85.57])
	(authenticated bits=0)
	by pooker.samsco.org (8.14.2/8.14.2) with ESMTP id nATNRnnW057251;
	Sun, 29 Nov 2009 16:27:49 -0700 (MST)
	(envelope-from scottl@samsco.org)
Mime-Version: 1.0 (Apple Message framework v1076)
From: Scott Long <scottl@samsco.org>
In-Reply-To: <3bbf2fe10911291429k54b4b7cfw9e40aefeca597307@mail.gmail.com>
Date: Sun, 29 Nov 2009 16:27:49 -0700
Message-Id: <66707B0F-D0AB-49DB-802F-13146F488E1A@samsco.org>
References: <200905191458.50764.jhb@freebsd.org>
	<alpine.BSF.2.00.0905200841230.981@desktop>
	<200905201522.58501.jhb@freebsd.org>
	<3bbf2fe10911291429k54b4b7cfw9e40aefeca597307@mail.gmail.com>
To: John Baldwin <jhb@freebsd.org>
X-Mailer: Apple Mail (2.1076)
X-Spam-Status: No, score=-4.4 required=3.8 tests=ALL_TRUSTED,AWL,BAYES_00,
	HTML_MESSAGE autolearn=ham version=3.1.8
X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on pooker.samsco.org
Content-Type: text/plain;
	charset=us-ascii;
	format=flowed;
	delsp=yes
Content-Transfer-Encoding: 7bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.5
Cc: Attilio Rao <attilio@freebsd.org>, arch@freebsd.org
Subject: Re: sglist(9)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 29 Nov 2009 23:44:42 -0000

John,

Sorry for the late reply on this.  Attilio approached me recently  
about moving busdma and storage to sglists; up until then I had  
largely ignored this conversation because I thought that it was only  
about the nvidia driver.

> On Wednesday 20 May 2009 2:49:30 pm Jeff Roberson wrote:
>> On Tue, 19 May 2009, John Baldwin wrote:
>>
>> 2) I worry that if all users do sglist_count() followed by a dynamic
>> allocation and then an _append() they will be very expensive.
>> pmap_kextract() is much more expensive than it may first seem to  
>> be.  Do
>> you have a user of count already?
>
> The only one that does now is sglist_build() and nothing currently  
> uses that.

Kinda silly to have it then.  I also don't see the point of it; if the  
point of the sglist object is to avoid VA mappings, then why start  
with a VA mapping?  But that aside, Jeff is correct, sglist_build() is  
horribly inefficient.

> VOP_GET/PUTPAGES would not need to do this since they could simply  
> append
> the physical addresses extracted directly from vm_page_t's for  
> example.  I'm
> not sure this will be used very much now as originally I thought I  
> would be
> changing all storage drivers to do all DMA operations using sglists  
> and this
> sort of thing would have been used for non-bio requests like firmware
> commands; however, as expounded on below, it actually appears better  
> to
> still treat bio's separate from non-bio requests for bus_dma so that  
> the
> non-bio requests can continue to use bus_dmamap_load_buffer() as  
> they do
> now.
>

I completely disagree with your approach to busdma, but I'll get into  
that later.  What I really don't understand is, why have yet another  
page description format?  Whether the I/O is disk buffers being pushed  
by a pager to a disk controller, or texture buffers being pushed to  
video hardware, they already have vm objects associated with them,  
no?  Why translate to an intermediate format?  I understand that  
you're creating another vm object format to deal directly with the  
needs of nvidia, but that's really a one-off case right now.  What  
about when we want the pagedaemon to push unmapped i/o?  Will it have  
to spend cycles translating its objects to sglists?  This is really a  
larger question that I'm not prepared to answer, but would like to  
discuss.

>> 3) Rather than having sg_segs be an actual pointer, did you consider
>> making it an unsized array?  This removes the overhead of one  
>> pointer from
>> the structure while enforcing that it's always contiguously  
>> allocated.
>
> It's actually a feature to be able to have the header in separate  
> storage from
> segs array.  I use this in the jhb_bio branch in the bus_dma  
> implementations
> where a pre-allocated segs array is stored in the bus dma tag and  
> the header
> is allocated on the stack.
>

I'd like to take this one step further.  Instead of sglists being  
exactly sized, I'd like to see them be much more like mbufs, with a  
header and static storage, maybe somewhere between 128b and 1k in  
total size.  Then they can be allocated and managed in pools, and  
chained together to make for easy appending, splitting, and growing.   
Offset pointers can be stored in the header instead of externally.   
Also, there are a lot of failure points in the API regarding to the  
sglist object being too small.  Those need to be fixed.

>> 4) SGLIST_INIT might be better off as an inline, and may not even  
>> belong
>> in the header file.
>
> That may be true.  I currently only use it in the jhb_bio branch for  
> the
> bus_dma implementations.
>
>> In general I think this is a good idea.  It'd be nice to work on  
>> replacing
>> the buf layer's implementation with something like this that could  
>> be used
>> directly by drivers.  Have you considered a busdma operation to  
>> load from
>> a sglist?
>
> So in regards to the bus_dma stuff, I did work on this a while ago  
> in my
> jhb_bio branch.  I do have a bus_dmamap_load_sglist() and I had  
> planned on
> using that in storage drivers directly.  However, I ended up  
> circling back
> to preferring a bus_dmamap_load_bio() and adding a new 'bio_start'  
> field
> to 'struct bio' that is an offset into an attached sglist.

I strongly disagree with forcing busdma to have intimate knowledge of  
bio's.  All of the needed information can be stored in sglist headers.

>  This let me
> carve up I/O requests in geom_dev to satisfy a disk device's max  
> request
> size while still sharing the same read-only sglist across the various
> BIO's (by simply adjusting bio_length and bio_start to be a subrange  
> of
> the sglist) as opposed to doing memory allocations to allocate  
> specific
> ranges of an sglist (using something like sglist_slice()) for each I/O
> request.

I think this is fundamentally wrong.  You're proposing exchanging a  
cheap operation of splitting VA's with an expensive operation of  
allocating, splitting, copying, and refcounting sglists.  Splitting is  
an excessively common operation, and your proposal will impact  
performance as storage becomes exponentially faster.

We need to stop thinking about maxio as a roadbump at the bottom of  
the storage stack, and instead think of it as a fundamental attribute  
that is honored at the top when a BIO is created.  Instead of loading  
up an sglist with all of the pages (and don't forget coalesced pages  
that might need to be broken up), maybe multiple bio's are created  
that honor maxio from the start, or a single bio with a chained  
sglist, with each chain link honoring maxio, allowing for easy  
splitting.

>  I then have bus_dmamap_load_bio() use the subrange of the
> sglist internally or fall back to using the KVA pointer if the sglist
> isn't present.

I completely disagree.  Drivers already deal with the details of  
bio's, and should continue to do so.  If a driver gets a bio that has  
a valid bio_data pointer, it should call bus_dmamap_load().  If it  
get's one with a valid sglist, it should call bus_dmamap_load_sglist 
().  Your proposal means that every storage driver in the system will  
have to change to use bus_dmamap_load_bio().  It's not a big change,  
but it's disruptive both in the tree and out.  Your proposal also  
implies that CAM will have to start carrying BIO's in CCBs and passing  
them to their SIMs.  I absolutely disagree with this.

If we keep unneeded complications out of busdma, we avoid a lot of  
churn.  We also leave the busdma interface available for other forms  
of I/O without requiring more specific APi additions to accommodate  
them.  What about unmapped network i/o coming from something like  
sendfile?

>
> However, I'm not really trying to get the bio stuff into the tree,  
> this
> is mostly for the Nvidia case and for that use case the driver is  
> simply
> creating simple single-entry lists and using sglist_append_phys().
>

Designing the whole API around a single driver that we can't even get  
the source to makes it hard to evaluate the API.

Attilio and I have spoken about this in private and will begin work on  
a prototype.  Here is the outline of what we're going to do:

1. Change struct sglist as so:
   a. Uniform size
   b. Determine an optimal number of elements to include in the size  
(waving my hands here, more research is needed).
   c. Chain, offset, and length pointers, very similar to how mbufs  
already work
2.  Expand the sglist API so that I/O producers can allocate slabs of  
sglists and slice them up into pools that they can manage and own
3.  Add an sglist field to struct bio, and add appropriate flags to  
identify VA vs sglist operation
4.  Extend the CAM_DATA_PHYS attributes in CAM to handle sglists.
5.  Add bus_dmamap_load_sglist().  This will be able to walk chains  
and combine, split, and reassign segments as needed.
6.  Modify a select number of drivers to use it.
7.  Add a flag to disk->d_flags to signal if a driver can handle  
sglists.  Have geom_dev look at this flag and generate a  
kmem_alloc_nofault+pmap_kenter sequence for drivers that can't support  
it.

In the end, no drivers will need to change, but the ones that do  
change will obviously benefit.  We're going to prototype this will an  
i/o source that starts unmapped (via the Xen blkback driver).  The  
downside is that most GEOM transforms that need to touch the data  
won't work, but that's something that can be addressed once the  
prototype is done and evaluated.

Scott