From owner-freebsd-net@FreeBSD.ORG Wed Sep 12 06:07:10 2012 Return-Path: Delivered-To: freebsd-net@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 5FFD41065672 for ; Wed, 12 Sep 2012 06:07:10 +0000 (UTC) (envelope-from oppermann@networx.ch) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) by mx1.freebsd.org (Postfix) with ESMTP id C370E8FC1A for ; Wed, 12 Sep 2012 06:07:09 +0000 (UTC) Received: (qmail 88989 invoked from network); 12 Sep 2012 07:52:35 -0000 Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 12 Sep 2012 07:52:35 -0000 Message-ID: <50502682.4010103@networx.ch> Date: Wed, 12 Sep 2012 08:06:58 +0200 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20120824 Thunderbird/15.0 MIME-Version: 1.0 To: Jeremiah Lott References: <201204270607.q3R67TiO026862@freefall.freebsd.org> <8339513D-6727-4343-A86E-4F5BB1F9827D@averesystems.com> In-Reply-To: <8339513D-6727-4343-A86E-4F5BB1F9827D@averesystems.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-net@FreeBSD.org, freebsd-bugs@FreeBSD.org Subject: Re: kern/167325: [netinet] [patch] sosend sometimes return EINVAL with TSO and VLAN on 82599 NIC X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Sep 2012 06:07:10 -0000 On 07.09.2012 23:44, Jeremiah Lott wrote: > On Apr 27, 2012, at 2:07 AM, linimon@FreeBSD.org wrote: > >> Old Synopsis: sosend sometimes return EINVAL with TSO and VLAN on 82599 NIC New Synopsis: >> [netinet] [patch] sosend sometimes return EINVAL with TSO and VLAN on 82599 NIC > >> http://www.freebsd.org/cgi/query-pr.cgi?pr=167325 > > I did an analysis of this pr a while back and I figured I'd share. Definitely looks like a real > problem here, but at least in 8.2 it is difficult to hit it. First off, vlan tagging is not > required to hit this. The code is question does not account for any amount of link-local header, > so you can reproduce the bug even without vlans. > > In order to trigger it, the tcp stack must choose to send a tso "packet" with a total size > (including tcp+ip header and options, but not link-local header) between 65522 and 65535 bytes > (because adding 14 byte link-local header will then exceed 64K limit). In 8.1, the tcp stack > only chooses to send tso bursts that will result in full mtu-size on-wire packets. To achieve > this, it will truncate the tso packet size to be a multiple of mss, not including header and tcp > options. The check has been relaxed a little in head, but the same basic check is still there. > None of the "normal" mtus have multiples falling in this range. To reproduce it I used an mtu of > 1445. When timestamps are in use, every packet has a 40 bytes tcp/ip header + 10 bytes for the > timestamp option + 2 bytes pad. You can get a packet length 65523 as follows: > > 65523 - (40 + 10 + 2) = 65471 (size of tso packet data) 65471 / 47 = 1393 (size of data per > on-wire packet) 1393 + (40 + 10 + 2) = 1445 (mtu is data + header + options + pad) > > Once you set your mtu to 1445, you need a program that can get the stack to send a maximum sized > packet. With the congestion window that can be more difficult than it seems. I used some python > that sends enough data to open the window, sleeps long enough to drain all outstanding data, but > not long enough for the congestion window to go stale and close again, then sends a bunch more > data. It also helps to turn off delayed acks on the receiver. Sometimes you will not drain the > entire send buffer because an ack for the final chunk is still delayed when you start the second > transmit. When the problem described in the pr hits, the EINVAL from bus_dmamap_load_mbuf_sg > bubbles right up to userspace. > > At first I thought this was a driver bug rather than stack bug. The code in question does what > it is commented to do (limit the tso packet so that ip->ip_len does not overflow). However, it > also seems reasonable that the driver limit its dma tag at 64K (do we really want it allocating > another whole page just for the 14 byte link-local header). Perhaps the tcp stack should ensure > that the tso packet + max_linkhdr is < 64K. Comments? Thank you for the analysis. I'm looking into it. > As an aside, the patch attached to the pr is also slightly wrong. Taking the max_linkhdr into > account when rounding the packet to be a multiple of mss does not make sense, it should only take > it into account when calculating the max tso length. -- Andre