Date: Mon, 13 Apr 2009 15:00:40 -0400 From: gnn@freebsd.org To: Zachary Loafman <zachary.loafman@isilon.com> Cc: freebsd-arch@freebsd.org Subject: Re: splice() in FreeBSD Message-ID: <7iskkcgyzr.wl%gnn@neville-neil.com> In-Reply-To: <20090409171613.GC9442@isilon.com> References: <20090409171613.GC9442@isilon.com>
next in thread | previous in thread | raw e-mail | index | archive | help
At Thu, 9 Apr 2009 10:16:13 -0700, Zachary Loafman wrote: > > Arch - > > Isilon has internally been using the FreeBSD sendfile() (with > modifications) and our own recvfile() in order to accomplish zero-copy > read/write for the userland portions of our stack (CIFS, > NDMP). However, these interfaces are limited. In particular, > sendfile/recvfile prevent any other thread from dealing with the same > socket until the call is complete. That's somewhat silly - it would be > nicer to split the read-from-file/write-to-file portion from the > read-from-socket/write-to-socket portion. That also eases some of the > decisions that only the layer above can really make - for example, in > the sendfile() case, you don't really know if it's appropriate to send > a partial read or whether the caller really needs all the data. > > What we'd like is something like splice(). The Linux splice interface > is documented here: http://linux.die.net/man/2/splice and the > internals are discussed here: http://kerneltrap.org/node/6505 . We > don't need the sillier portions of it - Isilon could care less about > vmsplice()/tee(). We need the ability to shuffle data from one source > to one sink, and then to turn around later and use that sink as a > source. At first, I found the splice() interface a bit of an > abomination, but a pipe is a somewhat natural place to act as a data > staging area. If we just implemented splice alone, this wouldn't > require any real VM hackery - you can imagine just shuffling mbufs > through the pipe to accomplish a limited form of this (or, say, a unix > domain socket). > > As part of this, and in order to get something upstreamable, it seems > like we would need a few things: > > *) Agreement on syscall APIs - My initial proposal is to adopt splice > verbatim. Initially the interface may not be truly zero-copy for many > cases, but it's a start. It also increases portability for any Linux > apps that are trying to make use of it. > > *) Unification of uio and mbufs somehow? Isilon currently has private > patches that add *_MBUF variants for I/O VOPs (e.g. we have a > VOP_READ_MBUF in addition to the standard VOP_READ). Isilon is in a > somewhat unique place here - I'm not sure a general file system can > handle this as easily. At the top-half, our system in many ways acts a > lot like a router, so we can handle things like VOP_READ_MBUF by > taking file data off our back-end (which comes in as mbufs off IB), > header splitting, then just slinging the mbufs out the > front-end. However, I think our *_MBUF VOP variants are actually a > little gross. I would rather figure out a way to unify the uio and > mbuf APIs - they're both scatter/gather lists in their own special > way, then call into a single VOP. > > Isilon can get a limited, non-upstreamable thing working fairly > quickly - we can use a unix domain socket as the intermediate buffer > and use our existing *_MBUF VOPs. But it would be nice if we had some > consensus going forward, then we can internally march towards > something we can upstream. > I like the idea, though I don't know if I like the name "splice" because to me it's a bit confusing, but we're probably stuck with the name since it's already in use. If/when you have patches send them along next. Best, George
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?7iskkcgyzr.wl%gnn>