From owner-svn-src-all@freebsd.org  Mon May  9 19:31:40 2016
Return-Path: <owner-svn-src-all@freebsd.org>
Delivered-To: svn-src-all@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id B1124B3455D;
 Mon,  9 May 2016 19:31:40 +0000 (UTC) (envelope-from slw@zxy.spb.ru)
Received: from zxy.spb.ru (zxy.spb.ru [195.70.199.98])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 7575719E7;
 Mon,  9 May 2016 19:31:40 +0000 (UTC) (envelope-from slw@zxy.spb.ru)
Received: from slw by zxy.spb.ru with local (Exim 4.86 (FreeBSD))
 (envelope-from <slw@zxy.spb.ru>)
 id 1azqu1-0004Uy-P2; Mon, 09 May 2016 22:31:37 +0300
Date: Mon, 9 May 2016 22:31:37 +0300
From: Slawa Olhovchenkov <slw@zxy.spb.ru>
To: John Baldwin <jhb@freebsd.org>
Cc: src-committers@freebsd.org, svn-src-head@freebsd.org,
 svn-src-all@freebsd.org
Subject: Re: svn commit: r299210 - in head/sys/dev/cxgbe: . tom
Message-ID: <20160509193137.GH1447@zxy.spb.ru>
References: <201605070033.u470XZCs075568@repo.freebsd.org>
 <3833131.rOKpC7i1Gu@ralph.baldwin.cx>
 <20160509185719.GG1447@zxy.spb.ru>
 <2354770.oBAqoHF8jb@ralph.baldwin.cx>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <2354770.oBAqoHF8jb@ralph.baldwin.cx>
User-Agent: Mutt/1.5.24 (2015-08-30)
X-SA-Exim-Connect-IP: <locally generated>
X-SA-Exim-Mail-From: slw@zxy.spb.ru
X-SA-Exim-Scanned: No (on zxy.spb.ru); SAEximRunCond expanded to false
X-BeenThere: svn-src-all@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: "SVN commit messages for the entire src tree \(except for &quot;
 user&quot; and &quot; projects&quot; \)" <svn-src-all.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-all/>
List-Post: <mailto:svn-src-all@freebsd.org>
List-Help: <mailto:svn-src-all-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 09 May 2016 19:31:40 -0000

On Mon, May 09, 2016 at 12:03:22PM -0700, John Baldwin wrote:

> On Monday, May 09, 2016 09:57:19 PM Slawa Olhovchenkov wrote:
> > On Mon, May 09, 2016 at 10:49:30AM -0700, John Baldwin wrote:
> > 
> > > On Saturday, May 07, 2016 04:44:51 PM Slawa Olhovchenkov wrote:
> > > > On Fri, May 06, 2016 at 05:52:15PM -0700, John Baldwin wrote:
> > > > 
> > > > > On Saturday, May 07, 2016 12:33:35 AM John Baldwin wrote:
> > > > > > Author: jhb
> > > > > > Date: Sat May  7 00:33:35 2016
> > > > > > New Revision: 299210
> > > > > > URL: https://svnweb.freebsd.org/changeset/base/299210
> > > > > > 
> > > > > > Log:
> > > > > >   Use DDP to implement zerocopy TCP receive with aio_read().
> > > > > >   
> > > > > >   Chelsio's TCP offload engine supports direct DMA of received TCP payload
> > > > > >   into wired user buffers.  This feature is known as Direct-Data Placement.
> > > > > >   However, to scale well the adapter needs to prepare buffers for DDP
> > > > > >   before data arrives.  aio_read() is more amenable to this requirement than
> > > > > >   read() as applications often call read() only after data is available in
> > > > > >   the socket buffer.
> > > > > >   
> > > > > >   When DDP is enabled, TOE sockets use the recently added pru_aio_queue
> > > > > >   protocol hook to claim aio_read(2) requests instead of letting them use
> > > > > >   the default AIO socket logic.  The DDP feature supports scheduling DMA
> > > > > >   to two buffers at a time so that the second buffer is ready for use
> > > > > >   after the first buffer is filled.  The aio/DDP code optimizes the case
> > > > > >   of an application ping-ponging between two buffers (similar to the
> > > > > >   zero-copy bpf(4) code) by keeping the two most recently used AIO buffers
> > > > > >   wired.  If a buffer is reused, the aio/DDP code is able to reuse the
> > > > > >   vm_page_t array as well as page pod mappings (a kind of MMU mapping the
> > > > > >   Chelsio NIC uses to describe user buffers).  The generation of the
> > > > > >   vmspace of the calling process is used in conjunction with the user
> > > > > >   buffer's address and length to determine if a user buffer matches a
> > > > > >   previously used buffer.  If an application queues a buffer for AIO that
> > > > > >   does not match a previously used buffer then the least recently used
> > > > > >   buffer is unwired before the new buffer is wired.  This ensures that no
> > > > > >   more than two user buffers per socket are ever wired.
> > > > > >   
> > > > > >   Note that this feature is best suited to applications sending a steady
> > > > > >   stream of data vs short bursts of traffic.
> > > > > >   
> > > > > >   Discussed with:	np
> > > > > >   Relnotes:	yes
> > > > > >   Sponsored by:	Chelsio Communications
> > > > > 
> > > > > The primary tool I used for evaluating performance was netperf's TCP stream
> > > > > test.  It is a best case for this (constant stream of traffic), but that is
> > > > > also the intended use case for this feature.
> > > > > 
> > > > > Using 2 64K buffers in a ping-pong via aio_read() to receive a 40Gbps stream
> > > > > used about about two full CPUs (~190% CPU usage) on a single-package
> > > > > Intel E5-1620 v3 @ 3.50GHz with the stock TCP stack.  Enabling TOE brings the
> > > > > usage down to about 110% CPU.  With DDP, the usage is around 30% of a single
> > > > > CPU.  With two 1MB buffers the the stock and TOE numbers are about the same,
> > > > > but the DDP usage is about 5% of single CPU.
> > > > > 
> > > > > Note that these numbers are with aio_read().  read() fares a bit better (180%
> > > > > for stock and 70% for TOE).  Before the AIO rework, trying to use aio_read()
> > > > > with two buffers in a ping-pong used twice as much CPU as bare read(), but
> > > > > aio_read() in general is now fairly comparable to read() at least in terms of
> > > > > CPU overhead.
> > > > 
> > > > Can be this impovement of nfsclient and etc?
> > > 
> > > The NFS client is implemented in the kernel (and doesn't use the AIO
> > > interfaces), so that would be a bit trickier to manage.  OTOH, this could be
> > > useful for something like rsync if that had an opton to use aio_read().
> > 
> > May be possible by some additional create some general API
> > for using inside kernel for nfsclient/nfsd/iscsi initiator/target/etc?
> > Automatic using, in ideal.
> > 
> > As I see reuiring aio in userland is for buffer pre-allocating and
> > pining, please check me, this is already true for all in-kernel
> > operations?
> 
> Not quite.  The NFS client just accepts whatever mbuf it gets with RPCs, it
> doesn't preallocate buffers specifically that are copied into (OTOH, this
> means that when NFS handles things metadata RPCs it is already "zero-copy").
> Also, because of the framing and the fact that you can't control the order in
> which RPCs are replied to, you can't queue buffers belonging to a file direct
> for DMA from the NIC (for example).

Oh, nfs/iscsi have additional framing, over TCP. This is cant't be
simple handling. Intel/chelsio have NFS/iSCSI offload, but this is
more complex, yes?