Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 7 May 2016 10:56:58 -0700
From:      Navdeep Parhar <np@FreeBSD.org>
To:        John Baldwin <jhb@freebsd.org>
Cc:        src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org
Subject:   Re: svn commit: r299210 - in head/sys/dev/cxgbe: . tom
Message-ID:  <20160507175658.GA4513@ox>
In-Reply-To: <3138889.ZBJ52FyIMB@ralph.baldwin.cx>
References:  <201605070033.u470XZCs075568@repo.freebsd.org> <3138889.ZBJ52FyIMB@ralph.baldwin.cx>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, May 06, 2016 at 05:52:15PM -0700, John Baldwin wrote:
> On Saturday, May 07, 2016 12:33:35 AM John Baldwin wrote:
> > Author: jhb
> > Date: Sat May  7 00:33:35 2016
> > New Revision: 299210
> > URL: https://svnweb.freebsd.org/changeset/base/299210
> > 
> > Log:
> >   Use DDP to implement zerocopy TCP receive with aio_read().
> >   
> >   Chelsio's TCP offload engine supports direct DMA of received TCP payload
> >   into wired user buffers.  This feature is known as Direct-Data Placement.
> >   However, to scale well the adapter needs to prepare buffers for DDP
> >   before data arrives.  aio_read() is more amenable to this requirement than
> >   read() as applications often call read() only after data is available in
> >   the socket buffer.
> >   
> >   When DDP is enabled, TOE sockets use the recently added pru_aio_queue
> >   protocol hook to claim aio_read(2) requests instead of letting them use
> >   the default AIO socket logic.  The DDP feature supports scheduling DMA
> >   to two buffers at a time so that the second buffer is ready for use
> >   after the first buffer is filled.  The aio/DDP code optimizes the case
> >   of an application ping-ponging between two buffers (similar to the
> >   zero-copy bpf(4) code) by keeping the two most recently used AIO buffers
> >   wired.  If a buffer is reused, the aio/DDP code is able to reuse the
> >   vm_page_t array as well as page pod mappings (a kind of MMU mapping the
> >   Chelsio NIC uses to describe user buffers).  The generation of the
> >   vmspace of the calling process is used in conjunction with the user
> >   buffer's address and length to determine if a user buffer matches a
> >   previously used buffer.  If an application queues a buffer for AIO that
> >   does not match a previously used buffer then the least recently used
> >   buffer is unwired before the new buffer is wired.  This ensures that no
> >   more than two user buffers per socket are ever wired.
> >   
> >   Note that this feature is best suited to applications sending a steady
> >   stream of data vs short bursts of traffic.
> >   
> >   Discussed with:	np
> >   Relnotes:	yes
> >   Sponsored by:	Chelsio Communications
> 
> The primary tool I used for evaluating performance was netperf's TCP stream
> test.  It is a best case for this (constant stream of traffic), but that is
> also the intended use case for this feature.
> 
> Using 2 64K buffers in a ping-pong via aio_read() to receive a 40Gbps stream
> used about about two full CPUs (~190% CPU usage) on a single-package
> Intel E5-1620 v3 @ 3.50GHz with the stock TCP stack.  Enabling TOE brings the
> usage down to about 110% CPU.  With DDP, the usage is around 30% of a single
> CPU.  With two 1MB buffers the the stock and TOE numbers are about the same,
> but the DDP usage is about 5% of single CPU.

5% of a single core on modern systems (with 4+ cores) means top/vmstat
will report around 1% aggregate CPU use or less while receiving full
40Gbps line rate @ 1500 MTU.

The idea here is to let applications written against standard BSD
sockets and POSIX AIO APIs make full use of hardware TCP zero copy
features when available.  Zero copy on the transmit side will also be
implemented (it's simpler than the receive side) in time for FreeBSD 11.

Regards,
Navdeep



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20160507175658.GA4513>