From owner-freebsd-net@FreeBSD.ORG Fri Sep 7 21:51:51 2012 Return-Path: Delivered-To: freebsd-net@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5E6401065678; Fri, 7 Sep 2012 21:51:51 +0000 (UTC) (envelope-from jlott@averesystems.com) Received: from mail.averesystems.com (mail.averesystems.com [208.70.68.85]) by mx1.freebsd.org (Postfix) with ESMTP id 1F5D38FC1C; Fri, 7 Sep 2012 21:51:50 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by mail.averesystems.com (Postfix) with ESMTP id 6919748089F; Fri, 7 Sep 2012 17:44:51 -0400 (EDT) X-Virus-Scanned: amavisd-new at mail.averesystems.com Received: from mail.averesystems.com ([127.0.0.1]) by localhost (mail.averesystems.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id WX-t-RmBeYJP; Fri, 7 Sep 2012 17:44:50 -0400 (EDT) Received: from jlott-mac.arriad.com (206.193.225.214.nauticom.net [206.193.225.214]) by mail.averesystems.com (Postfix) with ESMTPSA id 5E096480880; Fri, 7 Sep 2012 17:44:50 -0400 (EDT) Mime-Version: 1.0 (Apple Message framework v1084) Content-Type: text/plain; charset=us-ascii From: Jeremiah Lott In-Reply-To: <201204270607.q3R67TiO026862@freefall.freebsd.org> Date: Fri, 7 Sep 2012 17:44:48 -0400 Content-Transfer-Encoding: quoted-printable Message-Id: <8339513D-6727-4343-A86E-4F5BB1F9827D@averesystems.com> References: <201204270607.q3R67TiO026862@freefall.freebsd.org> To: freebsd-net@FreeBSD.org X-Mailer: Apple Mail (2.1084) Cc: freebsd-bugs@FreeBSD.org Subject: Re: kern/167325: [netinet] [patch] sosend sometimes return EINVAL with TSO and VLAN on 82599 NIC X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 07 Sep 2012 21:51:51 -0000 On Apr 27, 2012, at 2:07 AM, linimon@FreeBSD.org wrote: > Old Synopsis: sosend sometimes return EINVAL with TSO and VLAN on = 82599 NIC > New Synopsis: [netinet] [patch] sosend sometimes return EINVAL with = TSO and VLAN on 82599 NIC > http://www.freebsd.org/cgi/query-pr.cgi?pr=3D167325 I did an analysis of this pr a while back and I figured I'd share. = Definitely looks like a real problem here, but at least in 8.2 it is = difficult to hit it. First off, vlan tagging is not required to hit = this. The code is question does not account for any amount of = link-local header, so you can reproduce the bug even without vlans. In order to trigger it, the tcp stack must choose to send a tso "packet" = with a total size (including tcp+ip header and options, but not = link-local header) between 65522 and 65535 bytes (because adding 14 byte = link-local header will then exceed 64K limit). In 8.1, the tcp stack = only chooses to send tso bursts that will result in full mtu-size = on-wire packets. To achieve this, it will truncate the tso packet size = to be a multiple of mss, not including header and tcp options. The = check has been relaxed a little in head, but the same basic check is = still there. None of the "normal" mtus have multiples falling in this = range. To reproduce it I used an mtu of 1445. When timestamps are in = use, every packet has a 40 bytes tcp/ip header + 10 bytes for the = timestamp option + 2 bytes pad. You can get a packet length 65523 as = follows: 65523 - (40 + 10 + 2) =3D 65471 (size of tso packet data) 65471 / 47 =3D 1393 (size of data per on-wire packet) 1393 + (40 + 10 + 2) =3D 1445 (mtu is data + header + options + pad) Once you set your mtu to 1445, you need a program that can get the stack = to send a maximum sized packet. With the congestion window that can be = more difficult than it seems. I used some python that sends enough data = to open the window, sleeps long enough to drain all outstanding data, = but not long enough for the congestion window to go stale and close = again, then sends a bunch more data. It also helps to turn off delayed = acks on the receiver. Sometimes you will not drain the entire send = buffer because an ack for the final chunk is still delayed when you = start the second transmit. When the problem described in the pr hits, = the EINVAL from bus_dmamap_load_mbuf_sg bubbles right up to userspace. At first I thought this was a driver bug rather than stack bug. The = code in question does what it is commented to do (limit the tso packet = so that ip->ip_len does not overflow). However, it also seems = reasonable that the driver limit its dma tag at 64K (do we really want = it allocating another whole page just for the 14 byte link-local = header). Perhaps the tcp stack should ensure that the tso packet + = max_linkhdr is < 64K. Comments? As an aside, the patch attached to the pr is also slightly wrong. = Taking the max_linkhdr into account when rounding the packet to be a = multiple of mss does not make sense, it should only take it into account = when calculating the max tso length. Jeremiah Lott Avere Systems