From owner-freebsd-arch@freebsd.org Wed Jan 27 01:39:10 2016 Return-Path: Delivered-To: freebsd-arch@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5D9F5A6FA07 for ; Wed, 27 Jan 2016 01:39:10 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id 4E466153F for ; Wed, 27 Jan 2016 01:39:10 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: by mailman.ysv.freebsd.org (Postfix) id 4A933A6FA06; Wed, 27 Jan 2016 01:39:10 +0000 (UTC) Delivered-To: arch@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4A1ECA6FA05 for ; Wed, 27 Jan 2016 01:39:10 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 0B7F4153D for ; Wed, 27 Jan 2016 01:39:10 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from ralph.baldwin.cx (c-73-231-226-104.hsd1.ca.comcast.net [73.231.226.104]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 4D232B917 for ; Tue, 26 Jan 2016 20:39:08 -0500 (EST) From: John Baldwin To: arch@freebsd.org Subject: Refactoring asynchronous I/O Date: Tue, 26 Jan 2016 17:39:03 -0800 Message-ID: <2793494.0Z1kBV82mT@ralph.baldwin.cx> User-Agent: KMail/4.14.3 (FreeBSD/10.2-STABLE; KDE/4.14.3; amd64; ; ) MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Tue, 26 Jan 2016 20:39:08 -0500 (EST) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 27 Jan 2016 01:39:10 -0000 You may have noticed some small cleanups and fixes going into the AIO code recently. I have been working on a minor overhaul of the AIO code recently and the recent changes have been trying to reduce the diff down to the truly meaty changes so they are easier to review. I think things are far enough along to start on the meaty bits. The current AIO code is a bit creaky and not very extensible. It forces all requests down via the existing fo_read/fo_write file ops so that all requests are inherently synchronous even if the underlying file descriptor could support async operation. This also makes cancellation more fragile as you can't cancel a job that is stuck sleeping in one of the AIO daemons. The original motivation for my changes is to support efficient zero-copy receive for TOE using Chelsio T4/T5 adapters. However, read() is ill suited to this type of workflow. Various efforts in the past have tried using page flipping (the old ZERO_COPY_SOCKETS code which required custom ti(4) firmware) which only works if you can get things lined up "just right" (e.g. page-aligned and sized buffers, custom firmware on your NIC, etc.) or introducing new APIs that replace read/write (such as IO-Lite). The primary issue with read() of course is that the NIC DMAs data to one place and later userland comes along and tells the kernel where it wants the data. The issue with introducing a new API like IO-Lite is convincing software to use it. However, aio_read() is an existing API that can be used to queue user buffers in advance. In particular, you can use two buffers to ping-pong similar to the BPF zero-copy code where you queue two buffers at the start and requeue each completed buffer after consuming its data. In theory the Chelsio driver can "peek" into the request queue for a socket and schedule the pending requests for zero copy receive. However, doing that requires a way for allowing the driver to "claim" certain requests and support cancelling them, completing them, etc. To facilitate this use case I decided to rework the AIO code to use a model closer to the I/O Request Packets (Irps) that Windows drivers use. In particular, when a Windows driver decides to queue a request so that it can be completed later, it has to install a cancel routine that is responsible for cancelling a queued request. To this end, I have reworked the AIO code as such: 1) File descriptor types are now the "drivers" for AIO requests rather than the AIO code. When an AIO request for an fd is queued (via aio_read/write, etc.) a new file op (fo_aio_queue()) is called to queue or handle the request. This method is responsible for completeing the request or queueing it to be completed later. Currently, a default implementation of this method which queues the job to the existing AIO daemon pool for fo_read/fo_write is provided, but file types can override that with more specific behavior if desired. 2) The AIO code is now a library of calls for manipulating AIO requests. Functions to manage cancel routines, mark AIO requests as cancelled or completed, and schedule handler functions to run in an AIO daemon context are provided. 3) Operations that choose to queue an AIO request while waiting for a suitable resource to service it (CPU time, data to arrive on a socket, etc.) are required to install a cancel routine to handle cancellation of a request due to aio_cancel() or the exit or exec of the owning process. This allows the type-specific queueing logic to be self-contained without the AIO code having to know about all the possible queue states of an AIO request. In my current implementation I use the "default" fo_aio_queue method for most file types. However, sockets now use a private pool of AIO kprocs, and they also service sockets (rather than jobs). This means that when a socket becomes ready for either read or write, it queues a task for that socket buffer to the socket AIO daemon pool. That task will complete as many requests as possible for that socket buffer (ensuring that there are no concurrent AIO operations on a given socket). It is also able to use MSG_NOWAIT to avoid blocking even for blocking sockets. One thing I have not yet done is move the physio fast-path out of vfs_aio.c and into the devfs-specific fileops, but that can easily be done with a custom fo_aio_queue op for the devfs file ops. I believe that this approach also permits other file types to provide more suitable AIO handling when suitable. For the Chelsio use case I have added a protocol hook to allow a given protocol to claim AIO requests instead of letting them be handled by the generic socket AIO fileop. This ends up being a very small change, and the Chelsio-specific logic can live in the TOM module using the AIO library calls to service the AIO requests. My current WIP (not including the Chelsio bits, they need to be forward ported from an earlier prototype) is available on the 'aio_rework' branch of freebsd in my github space: https://github.com/freebsd/freebsd/compare/master...bsdjhb:aio_rework Note that binding the AIO support to a new fileop does mean that the AIO code now becomes mandatory (rather than optional). We could perhaps make the system calls continue to be optional if people really need that, but the guts of the code will now need to always be in the kernel. I'd like to hear what people think of this design. It needs some additional cleanup before it is a commit candidate (and I'll see if I can't split it up some more if we go this route). -- John Baldwin