From owner-freebsd-current@FreeBSD.ORG  Wed Dec  7 17:51:59 2011
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id BC76C1065677
	for <current@freebsd.org>; Wed,  7 Dec 2011 17:51:59 +0000 (UTC)
	(envelope-from luigi@onelab2.iet.unipi.it)
Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238])
	by mx1.freebsd.org (Postfix) with ESMTP id 59F6E8FC15
	for <current@freebsd.org>; Wed,  7 Dec 2011 17:51:59 +0000 (UTC)
Received: by onelab2.iet.unipi.it (Postfix, from userid 275)
	id 8F8757300A; Wed,  7 Dec 2011 19:08:07 +0100 (CET)
Date: Wed, 7 Dec 2011 19:08:07 +0100
From: Luigi Rizzo <rizzo@iet.unipi.it>
To: Andre Oppermann <andre@freebsd.org>
Message-ID: <20111207180807.GA71878@onelab2.iet.unipi.it>
References: <20111205192703.GA49118@onelab2.iet.unipi.it>
	<2D87D847-A2B7-4E77-B6C1-61D73C9F582F@digsys.bg>
	<20111205222834.GA50285@onelab2.iet.unipi.it>
	<4EDDF9F4.9070508@digsys.bg> <4EDE259B.4010502@digsys.bg>
	<CAFOYbcmVR_K0iZU_Z4TxDVzPzx6-GZuzfCxUZbf6KQn4siF2UA@mail.gmail.com>
	<F5BCA7E9-6A61-4492-9F18-423178E9C9B4@digsys.bg>
	<20111206210625.GB62605@onelab2.iet.unipi.it>
	<4EDF471F.1030202@freebsd.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4EDF471F.1030202@freebsd.org>
User-Agent: Mutt/1.4.2.3i
Cc: current@freebsd.org, Jack Vogel <jfvogel@gmail.com>,
	Daniel Kalchev <daniel@digsys.bg>
Subject: quick summary results with ixgbe (was Re: datapoints on 10G
	throughput with TCP ?
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 07 Dec 2011 17:51:59 -0000

On Wed, Dec 07, 2011 at 11:59:43AM +0100, Andre Oppermann wrote:
> On 06.12.2011 22:06, Luigi Rizzo wrote:
...
> >Even in my experiments there is a lot of instability in the results.
> >I don't know exactly where the problem is, but the high number of
> >read syscalls, and the huge impact of setting interrupt_rate=0
> >(defaults at 16us on the ixgbe) makes me think that there is something
> >that needs investigation in the protocol stack.
> >
> >Of course we don't want to optimize specifically for the one-flow-at-10G
> >case, but devising something that makes the system less affected
> >by short timing variations, and can pass upstream interrupt mitigation
> >delays would help.
> 
> I'm not sure the variance is only coming from the network card and
> driver side of things.  The TCP processing and interactions with
> scheduler and locking probably play a big role as well.  There have
> been many changes to TCP recently and maybe an inefficiency that
> affects high-speed single sessions throughput has crept in.  That's
> difficult to debug though.

I ran a bunch of tests on the ixgbe (82599) using RELENG_8 (which
seems slightly faster than HEAD) using MTU=1500 and various
combinations of card capabilities (hwcsum,tso,lro), different window
sizes and interrupt mitigation configurations.

default latency is 16us, l=0 means no interrupt mitigation.
lro is the software implementation of lro (tcp_lro.c)
hwlro is the hardware one (on 82599). Using a window of 100 Kbytes
seems to give the best results.

Summary:

- with default interrupt mitigation, the fastest configuration
  is with checksums enabled on both sender and receiver, lro
  enabled on the receiver. This gets about 8.0 Gbit/s

- lro is especially good because it packs data packets together,
  passing mitigation upstream and removing duplicate work in
  the ip and tcp stack.

- disabling LRO on the receiver brings performance to 6.5 Gbit/s.
  Also it increases the CPU load (also in userspace).

- disabling checksums on the sender reduces transmit speed to 5.5 Gbit/s

- checksums disabled on both sides (and no LRO on the receiver) go
  down to 4.8 Gbit/s

- I could not try the receive side without checksum but with lro

- with default interrupt mitigation, setting both
  HWCSUM and TSO on the sender is really disruptive.
  Depending on lro settings on the receiver i get 1.5 to 3.2 Gbit/s
  and huge variance

- Using both hwcsum and tso seems to work fine if you
  disable interrupt mitigation (reaching a peak of 9.4 Gbit/s).

- enabling software lro on the transmit side actually slows
  down the throughput (4-5Gbit/s instead of 8.0).
  I am not sure why (perhaps acks are delayed too much) ?
  Adding a couple of lines in tcp_lro to reject
  pure acks seems to have much better effect.

The tcp_lro patch below might actually be useful also for
other cards.

--- tcp_lro.c   (revision 228284)
+++ tcp_lro.c   (working copy)
@@ -245,6 +250,8 @@
 
        ip_len = ntohs(ip->ip_len);
        tcp_data_len = ip_len - (tcp->th_off << 2) - sizeof (*ip);
+       if (tcp_data_len == 0)
+               return -1;      /* not on ack */
        
 
        /* 


cheers
luigi