From owner-svn-src-all@freebsd.org Sat May 7 01:15:44 2016 Return-Path: Delivered-To: svn-src-all@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0D5F2B2EC48; Sat, 7 May 2016 01:15:44 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id E229C19BD; Sat, 7 May 2016 01:15:43 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from ralph.baldwin.cx (c-73-231-226-104.hsd1.ca.comcast.net [73.231.226.104]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id EA2C1B97D; Fri, 6 May 2016 21:15:42 -0400 (EDT) From: John Baldwin To: src-committers@freebsd.org Cc: svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r299210 - in head/sys/dev/cxgbe: . tom Date: Fri, 06 May 2016 17:52:15 -0700 Message-ID: <3138889.ZBJ52FyIMB@ralph.baldwin.cx> User-Agent: KMail/4.14.3 (FreeBSD/10.2-STABLE; KDE/4.14.3; amd64; ; ) In-Reply-To: <201605070033.u470XZCs075568@repo.freebsd.org> References: <201605070033.u470XZCs075568@repo.freebsd.org> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Fri, 06 May 2016 21:15:43 -0400 (EDT) X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 07 May 2016 01:15:44 -0000 On Saturday, May 07, 2016 12:33:35 AM John Baldwin wrote: > Author: jhb > Date: Sat May 7 00:33:35 2016 > New Revision: 299210 > URL: https://svnweb.freebsd.org/changeset/base/299210 > > Log: > Use DDP to implement zerocopy TCP receive with aio_read(). > > Chelsio's TCP offload engine supports direct DMA of received TCP payload > into wired user buffers. This feature is known as Direct-Data Placement. > However, to scale well the adapter needs to prepare buffers for DDP > before data arrives. aio_read() is more amenable to this requirement than > read() as applications often call read() only after data is available in > the socket buffer. > > When DDP is enabled, TOE sockets use the recently added pru_aio_queue > protocol hook to claim aio_read(2) requests instead of letting them use > the default AIO socket logic. The DDP feature supports scheduling DMA > to two buffers at a time so that the second buffer is ready for use > after the first buffer is filled. The aio/DDP code optimizes the case > of an application ping-ponging between two buffers (similar to the > zero-copy bpf(4) code) by keeping the two most recently used AIO buffers > wired. If a buffer is reused, the aio/DDP code is able to reuse the > vm_page_t array as well as page pod mappings (a kind of MMU mapping the > Chelsio NIC uses to describe user buffers). The generation of the > vmspace of the calling process is used in conjunction with the user > buffer's address and length to determine if a user buffer matches a > previously used buffer. If an application queues a buffer for AIO that > does not match a previously used buffer then the least recently used > buffer is unwired before the new buffer is wired. This ensures that no > more than two user buffers per socket are ever wired. > > Note that this feature is best suited to applications sending a steady > stream of data vs short bursts of traffic. > > Discussed with: np > Relnotes: yes > Sponsored by: Chelsio Communications The primary tool I used for evaluating performance was netperf's TCP stream test. It is a best case for this (constant stream of traffic), but that is also the intended use case for this feature. Using 2 64K buffers in a ping-pong via aio_read() to receive a 40Gbps stream used about about two full CPUs (~190% CPU usage) on a single-package Intel E5-1620 v3 @ 3.50GHz with the stock TCP stack. Enabling TOE brings the usage down to about 110% CPU. With DDP, the usage is around 30% of a single CPU. With two 1MB buffers the the stock and TOE numbers are about the same, but the DDP usage is about 5% of single CPU. Note that these numbers are with aio_read(). read() fares a bit better (180% for stock and 70% for TOE). Before the AIO rework, trying to use aio_read() with two buffers in a ping-pong used twice as much CPU as bare read(), but aio_read() in general is now fairly comparable to read() at least in terms of CPU overhead. -- John Baldwin