Date: Wed, 27 Jul 2016 12:11:06 -0700 From: John Baldwin <jhb@freebsd.org> To: src-committers@freebsd.org Cc: svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r303405 - in head/sys/dev/cxgbe: . tom Message-ID: <3422795.rot3cCl2OH@ralph.baldwin.cx> In-Reply-To: <201607271829.u6RITZlx041710@repo.freebsd.org> References: <201607271829.u6RITZlx041710@repo.freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wednesday, July 27, 2016 06:29:35 PM John Baldwin wrote: > Author: jhb > Date: Wed Jul 27 18:29:35 2016 > New Revision: 303405 > URL: https://svnweb.freebsd.org/changeset/base/303405 > > Log: > Add support for zero-copy aio_write() on TOE sockets. > > AIO write requests for a TOE socket on a Chelsio T4+ adapter can now > DMA directly from the user-supplied buffer. This is implemented by > wiring the pages backing the user-supplied buffer and queueing special > mbufs backed by raw VM pages to the socket buffer. The TOE code > recognizes these special mbufs and builds a sglist from the VM page > array associated with the mbuf when queueing a work request to the TOE. > > Because these mbufs do not have an associated virtual address, m_data > is not valid. Thus, the AIO handler does not invoke sosend() directly > for these mbufs but instead inlines portions of sosend_generic() and > tcp_usr_send(). > > An aiotx_buffer structure is used to describe the user buffer (e.g. > it holds the array of VM pages and a reference to the AIO job). The > special mbufs reference this structure via m_ext. Note that a single > job might be split across multiple mbufs (e.g. if it is larger than > the socket buffer size). The 'ext_arg2' member of each mbuf gives an > offset relative to the backing aiotx_buffer. The AIO job associated > with an aiotx_buffer structure is completed when the last reference to > the structure is released. > > Zero-copy aio_write()'s for connections associated with a given > adapter can be enabled/disabled at runtime via the > 'dev.t[45]nex.N.toe.tx_zcopy' sysctl. In theory if our stack was able to safely cope with unmapped buffers, this could be used for aio_write() on sockets in general rather than only with TOE engines (in particular it is probably safe on adapters with checksum offload enabled assuming you aren't using some sort of packet filter that wants to inspect packet payloads rather than just headers). Compared to the changes for zerocopy receive with TOE, these are simpler and also less dramatic in terms of performance. For benchmarking I used netperf's TCP_STREAM (write(2) and aio_write(2)) and TCP_SENDFILE (sendfile(2)) tests comparing CPU usage and throughput. Without TOE, write(2) uses 138-158% of a single CPU, sendfile(2) uses 117-145%, and aio_write(2) uses 139-202% to send a single 40G stream. At the default write size (32k) for netperf, aio_write() and write() have the largest divergence, but for other write sizes I tested (256k, 512k, 1m) they were within a few percentage points of each other. Enabling TOE reduced CPU usage for all three (in this case zero-copy aio_write() is not enabled): write(2) - 81-87%, sendfile(2) - 58-69%, aio_write(2) - 83-142%. Again, aio_write() was comparable for larger write sizes, but diverged from write(2) for 32k writes. One other thing to note however, is that the sendfile(2) test did not achieve full 40G with 32k and 256k write sizes. The larger tests sizes that did achieve 40G with sendfile(2) used 66-69% CPU. The zero-copy aio_write(2) with TOE used 23-54% of a single core. As with sendfile(2), the 32k write size did not achieve 40G. The remaining test sizes that did used 23-24%. There are still some wrinkles I need to iron out with the zero-copy case. In certain edge cases I can force it to transmit a dismally low 1G or so instead of 40G, but only if socket buffer autosizing is enabled. The reason smaller write sizes don't get to line rate with sendfile/aio_write (I think) is that non-zero copy writes can accumulate data from multiple calls into the socket buffer allowing the socket buffer autosizing code to notice and kick in sooner / faster. One could ameliorate this by queueing more writes, though queueing larger writes has the same effect and probably less overhead. -- John Baldwin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3422795.rot3cCl2OH>