From owner-freebsd-current@FreeBSD.ORG Wed Dec 7 17:51:59 2011 Return-Path: Delivered-To: current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BC76C1065677 for ; Wed, 7 Dec 2011 17:51:59 +0000 (UTC) (envelope-from luigi@onelab2.iet.unipi.it) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id 59F6E8FC15 for ; Wed, 7 Dec 2011 17:51:59 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id 8F8757300A; Wed, 7 Dec 2011 19:08:07 +0100 (CET) Date: Wed, 7 Dec 2011 19:08:07 +0100 From: Luigi Rizzo To: Andre Oppermann Message-ID: <20111207180807.GA71878@onelab2.iet.unipi.it> References: <20111205192703.GA49118@onelab2.iet.unipi.it> <2D87D847-A2B7-4E77-B6C1-61D73C9F582F@digsys.bg> <20111205222834.GA50285@onelab2.iet.unipi.it> <4EDDF9F4.9070508@digsys.bg> <4EDE259B.4010502@digsys.bg> <20111206210625.GB62605@onelab2.iet.unipi.it> <4EDF471F.1030202@freebsd.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4EDF471F.1030202@freebsd.org> User-Agent: Mutt/1.4.2.3i Cc: current@freebsd.org, Jack Vogel , Daniel Kalchev Subject: quick summary results with ixgbe (was Re: datapoints on 10G throughput with TCP ? X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 07 Dec 2011 17:51:59 -0000 On Wed, Dec 07, 2011 at 11:59:43AM +0100, Andre Oppermann wrote: > On 06.12.2011 22:06, Luigi Rizzo wrote: ... > >Even in my experiments there is a lot of instability in the results. > >I don't know exactly where the problem is, but the high number of > >read syscalls, and the huge impact of setting interrupt_rate=0 > >(defaults at 16us on the ixgbe) makes me think that there is something > >that needs investigation in the protocol stack. > > > >Of course we don't want to optimize specifically for the one-flow-at-10G > >case, but devising something that makes the system less affected > >by short timing variations, and can pass upstream interrupt mitigation > >delays would help. > > I'm not sure the variance is only coming from the network card and > driver side of things. The TCP processing and interactions with > scheduler and locking probably play a big role as well. There have > been many changes to TCP recently and maybe an inefficiency that > affects high-speed single sessions throughput has crept in. That's > difficult to debug though. I ran a bunch of tests on the ixgbe (82599) using RELENG_8 (which seems slightly faster than HEAD) using MTU=1500 and various combinations of card capabilities (hwcsum,tso,lro), different window sizes and interrupt mitigation configurations. default latency is 16us, l=0 means no interrupt mitigation. lro is the software implementation of lro (tcp_lro.c) hwlro is the hardware one (on 82599). Using a window of 100 Kbytes seems to give the best results. Summary: - with default interrupt mitigation, the fastest configuration is with checksums enabled on both sender and receiver, lro enabled on the receiver. This gets about 8.0 Gbit/s - lro is especially good because it packs data packets together, passing mitigation upstream and removing duplicate work in the ip and tcp stack. - disabling LRO on the receiver brings performance to 6.5 Gbit/s. Also it increases the CPU load (also in userspace). - disabling checksums on the sender reduces transmit speed to 5.5 Gbit/s - checksums disabled on both sides (and no LRO on the receiver) go down to 4.8 Gbit/s - I could not try the receive side without checksum but with lro - with default interrupt mitigation, setting both HWCSUM and TSO on the sender is really disruptive. Depending on lro settings on the receiver i get 1.5 to 3.2 Gbit/s and huge variance - Using both hwcsum and tso seems to work fine if you disable interrupt mitigation (reaching a peak of 9.4 Gbit/s). - enabling software lro on the transmit side actually slows down the throughput (4-5Gbit/s instead of 8.0). I am not sure why (perhaps acks are delayed too much) ? Adding a couple of lines in tcp_lro to reject pure acks seems to have much better effect. The tcp_lro patch below might actually be useful also for other cards. --- tcp_lro.c (revision 228284) +++ tcp_lro.c (working copy) @@ -245,6 +250,8 @@ ip_len = ntohs(ip->ip_len); tcp_data_len = ip_len - (tcp->th_off << 2) - sizeof (*ip); + if (tcp_data_len == 0) + return -1; /* not on ack */ /* cheers luigi