Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 27 Jul 2016 12:11:06 -0700
From:      John Baldwin <jhb@freebsd.org>
To:        src-committers@freebsd.org
Cc:        svn-src-all@freebsd.org, svn-src-head@freebsd.org
Subject:   Re: svn commit: r303405 - in head/sys/dev/cxgbe: . tom
Message-ID:  <3422795.rot3cCl2OH@ralph.baldwin.cx>
In-Reply-To: <201607271829.u6RITZlx041710@repo.freebsd.org>
References:  <201607271829.u6RITZlx041710@repo.freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wednesday, July 27, 2016 06:29:35 PM John Baldwin wrote:
> Author: jhb
> Date: Wed Jul 27 18:29:35 2016
> New Revision: 303405
> URL: https://svnweb.freebsd.org/changeset/base/303405
> 
> Log:
>   Add support for zero-copy aio_write() on TOE sockets.
>   
>   AIO write requests for a TOE socket on a Chelsio T4+ adapter can now
>   DMA directly from the user-supplied buffer.  This is implemented by
>   wiring the pages backing the user-supplied buffer and queueing special
>   mbufs backed by raw VM pages to the socket buffer.  The TOE code
>   recognizes these special mbufs and builds a sglist from the VM page
>   array associated with the mbuf when queueing a work request to the TOE.
>   
>   Because these mbufs do not have an associated virtual address, m_data
>   is not valid.  Thus, the AIO handler does not invoke sosend() directly
>   for these mbufs but instead inlines portions of sosend_generic() and
>   tcp_usr_send().
>   
>   An aiotx_buffer structure is used to describe the user buffer (e.g.
>   it holds the array of VM pages and a reference to the AIO job).  The
>   special mbufs reference this structure via m_ext.  Note that a single
>   job might be split across multiple mbufs (e.g. if it is larger than
>   the socket buffer size).  The 'ext_arg2' member of each mbuf gives an
>   offset relative to the backing aiotx_buffer.  The AIO job associated
>   with an aiotx_buffer structure is completed when the last reference to
>   the structure is released.
>   
>   Zero-copy aio_write()'s for connections associated with a given
>   adapter can be enabled/disabled at runtime via the
>   'dev.t[45]nex.N.toe.tx_zcopy' sysctl.

In theory if our stack was able to safely cope with unmapped buffers, this
could be used for aio_write() on sockets in general rather than only with TOE
engines (in particular it is probably safe on adapters with checksum offload
enabled assuming you aren't using some sort of packet filter that wants to
inspect packet payloads rather than just headers).

Compared to the changes for zerocopy receive with TOE, these are simpler and
also less dramatic in terms of performance.  For benchmarking I used
netperf's TCP_STREAM (write(2) and aio_write(2)) and TCP_SENDFILE
(sendfile(2)) tests comparing CPU usage and throughput.

Without TOE, write(2) uses 138-158% of a single CPU, sendfile(2) uses
117-145%, and aio_write(2) uses 139-202% to send a single 40G stream.  At the
default write size (32k) for netperf, aio_write() and write() have the
largest divergence, but for other write sizes I tested (256k, 512k, 1m) they
were within a few percentage points of each other.

Enabling TOE reduced CPU usage for all three (in this case zero-copy
aio_write() is not enabled): write(2) - 81-87%, sendfile(2) - 58-69%,
aio_write(2) - 83-142%.  Again, aio_write() was comparable for larger write
sizes, but diverged from write(2) for 32k writes.  One other thing to note
however, is that the sendfile(2) test did not achieve full 40G with 32k and
256k write sizes.  The larger tests sizes that did achieve 40G with
sendfile(2) used 66-69% CPU.

The zero-copy aio_write(2) with TOE used 23-54% of a single core.  As with
sendfile(2), the 32k write size did not achieve 40G.  The remaining test
sizes that did used 23-24%.

There are still some wrinkles I need to iron out with the zero-copy case.  In
certain edge cases I can force it to transmit a dismally low 1G or so instead
of 40G, but only if socket buffer autosizing is enabled.

The reason smaller write sizes don't get to line rate with sendfile/aio_write
(I think) is that non-zero copy writes can accumulate data from multiple calls
into the socket buffer allowing the socket buffer autosizing code to notice
and kick in sooner / faster.  One could ameliorate this by queueing more
writes, though queueing larger writes has the same effect and probably less
overhead.

-- 
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3422795.rot3cCl2OH>