From owner-freebsd-current@FreeBSD.ORG  Sun Nov 18 23:26:36 2007
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 69C1D16A417;
	Sun, 18 Nov 2007 23:26:36 +0000 (UTC)
	(envelope-from mandrews@bit0.com)
Received: from mindcrime.bit0.com (bit0.com [207.246.88.211])
	by mx1.freebsd.org (Postfix) with ESMTP id CDEF713C4AC;
	Sun, 18 Nov 2007 23:26:35 +0000 (UTC)
	(envelope-from mandrews@bit0.com)
Received: from localhost (localhost.bit0.com [127.0.0.1])
	by mindcrime.bit0.com (Postfix) with ESMTP id 8222A1E3379;
	Sun, 18 Nov 2007 18:26:25 -0500 (EST)
X-Virus-Scanned: amavisd-new at bit0.com
Received: from mindcrime.bit0.com ([127.0.0.1])
	by localhost (mindcrime.int.bit0.com [127.0.0.1]) (amavisd-new,
	port 10024)
	with ESMTP id eRYh6LHWaU2P; Sun, 18 Nov 2007 18:26:23 -0500 (EST)
Received: from localhost (localhost.bit0.com [127.0.0.1])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mindcrime.bit0.com (Postfix) with ESMTP;
	Sun, 18 Nov 2007 18:26:23 -0500 (EST)
Date: Sun, 18 Nov 2007 18:26:23 -0500 (EST)
From: Mike Andrews <mandrews@bit0.com>
X-X-Sender: mandrews@mindcrime.int.bit0.com
To: Jack Vogel <jfvogel@gmail.com>
In-Reply-To: <2a41acea0711181140w6707b85p18ac9a483ae367b7@mail.gmail.com>
Message-ID: <20071118181625.Y19404@mindcrime.int.bit0.com>
References: <20071117003504.R31357@mindcrime.int.bit0.com> 
	<20071117170537.F59492@mindcrime.int.bit0.com> 
	<b1fa29170711171519r65473426s1b9f3d9666ff6a92@mail.gmail.com> 
	<20071117182232.T59492@mindcrime.int.bit0.com> 
	<b1fa29170711171619x24233a3cw4361e0f3ca395e4c@mail.gmail.com> 
	<473F9552.50402@bit0.com>
	<b1fa29170711171804x36e4ae51ie03d01e4bc0220ac@mail.gmail.com>
	<473FBD1A.8010207@bit0.com>
	<20071118030305.N99375@mindcrime.int.bit0.com>
	<2a41acea0711181133n5f63f932m714a4a6b790937c0@mail.gmail.com>
	<2a41acea0711181140w6707b85p18ac9a483ae367b7@mail.gmail.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Denis Shaposhnikov <dsh@vlink.ru>, Kip Macy <kip.macy@gmail.com>,
	Mike Silbersack <silby@freebsd.org>,
	Andre Oppermann <andre@freebsd.org>, freebsd-current@freebsd.org
Subject: Re: bizarre em + TSO + MSS issue in RELENG_7
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 18 Nov 2007 23:26:36 -0000

On Sun, 18 Nov 2007, Jack Vogel wrote:

> On Nov 18, 2007 11:33 AM, Jack Vogel <jfvogel@gmail.com> wrote:
>>
>> On Nov 18, 2007 12:58 AM, Mike Andrews <mandrews@bit0.com> wrote:
>>>
>>> On Sat, 17 Nov 2007, Mike Andrews wrote:
>>>
>>>> Kip Macy wrote:
>>>>> On Nov 17, 2007 5:28 PM, Mike Andrews <mandrews@bit0.com> wrote:
>>>>>> Kip Macy wrote:
>>>>>>> On Nov 17, 2007 3:23 PM, Mike Andrews <mandrews@bit0.com> wrote:
>>>>>>>> On Sat, 17 Nov 2007, Kip Macy wrote:
>>>>>>>>
>>>>>>>>> On Nov 17, 2007 2:33 PM, Mike Andrews <mandrews@bit0.com> wrote:
>>>>>>>>>> On Sat, 17 Nov 2007, Kip Macy wrote:
>>>>>>>>>>
>>>>>>>>>>> On Nov 17, 2007 10:33 AM, Denis Shaposhnikov <dsh@vlink.ru> wrote:
>>>>>>>>>>>> On Sat, 17 Nov 2007 00:42:54 -0500 (EST)
>>>>>>>>>>>> Mike Andrews <mandrews@bit0.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Has anyone run into problems with MSS not being respected when
>>>>>>>>>>>>> using
>>>>>>>>>>>>> TSO, specifically on em cards?
>>>>>>>>>>>> Yes, I wrote about this problem on the beginning of 2007, see
>>>>>>>>>>>>
>>>>>>>>>>>>     http://tinyurl.com/3e5ak5
>>>>>>>>>>>>
>>>>>>>>>>> if_em.c:3502
>>>>>>>>>>>        /*
>>>>>>>>>>>         * Payload size per packet w/o any headers.
>>>>>>>>>>>         * Length of all headers up to payload.
>>>>>>>>>>>         */
>>>>>>>>>>>        TXD->tcp_seg_setup.fields.mss =
>>>>>>>>>>> htole16(mp->m_pkthdr.tso_segsz);
>>>>>>>>>>>        TXD->tcp_seg_setup.fields.hdr_len = hdr_len;
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Please print out the value of tso_segsz here. It appears to be being
>>>>>>>>>>> set correctly. The only thing I can think of is that t_maxopd is not
>>>>>>>>>>> correct. As tso_segsz is correct here:
>>>>>>>>>> It repeatedly prints 1368 during a 1 meg file transfer over a
>>>>>>>>>> connection
>>>>>>>>>> with a 1380 MSS.  Any other printf's I can add?  I'm working on a web
>>>>>>>>>> page
>>>>>>>>>> with tcpdump / firewall log output illustrating the issue...
>>>>>>>>> Mike -
>>>>>>>>> Denis' tcpdump output doesn't show oversized segments, something else
>>>>>>>>> appears to be happening there. Can you post your tcpdump output
>>>>>>>>> somewhere?
>>>>>>>> URL sent off-list.
>>>>>>>        if (tso) {
>>>>>>>                m->m_pkthdr.csum_flags = CSUM_TSO;
>>>>>>>                m->m_pkthdr.tso_segsz = tp->t_maxopd - optlen;
>>>>>>>        }
>>>>>>>
>>>>>>>
>>>>>>> Please print the value of maxopd and optlen under "if (tso)" in
>>>>>>> tcp_output. I think the calculated optlen may be too small.
>>>>>>
>>>>>> maxopt=1380 - optlen=12 = tso_segsz=1368
>>>>>>
>>>>>> Weird though, after this reboot, I had to re-copy a 4 meg file 5 times
>>>>>> to start getting the firewall to log any drops.  Transfer rate was
>>>>>> around 240KB/sec before the firewall started to drop, then it went down
>>>>>> to about 64KB/sec during the 5th copy, and stayed there for subsequent
>>>>>> copies.  The actual packet size the firewall said it was dropping was
>>>>>> varying all over the place still, yet the maxopt/optlen/tso_segsz values
>>>>>> stayed constant.  But it's interesting that it didn't start dropping
>>>>>> immediately after the reboot -- though the transfer rate was still
>>>>>> sub-optimal.
>>>>>
>>>>> Ok, next theory :D. You shouldn't be seeing "bad len" packets from
>>>>> tcpdump. I'm wondering if that means you're sending down more than
>>>>> 64k. Can you please print out the value of mp->m_pkthdr.len around the
>>>>> same place that you printed out tso_segsz? 64k is the generally
>>>>> accepted limit for TSO, I'm wondering if the card firmware does
>>>>> something weird if you give it more.
>>>>
>>>> OK.  In that last message, where I said it took 5 times to start reproducing
>>>> the problem... this time it took until I actually toggled TSO back off and
>>>> back on again, and then it started acting up again.  I don't know what the
>>>> actual trigger is... it's very weird.
>>>>
>>>> Initially, w/ TSO on and it wasn't dropping yet (but was still transferring
>>>> slow)...
>>>>
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=8306
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=8306
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=8306
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=8306
>>>> (etc, always 8306)
>>>>
>>>> After toggling off/on which caused the drops to start (and the speed to drop
>>>> even further):
>>>>
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=7507
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=3053
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=1677
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=3037
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=2264
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=1656
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=1902
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=1888
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=1640
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=1871
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=2461
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=1849
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=2092
>>>>
>>>> and so on, with more seemingly random lengths... but none of them ever over
>>>> 8306, much less 64K.
>>>
>>>
>>> Got a few more data points here.
>>>
>>> I can reproduce this on an i386 kernel, so it isn't amd64 specific.
>>>
>>> I can reproduce this on an 82541EI nic, so it isn't 82573 specific.
>>>
>>> I can't reproduce this on a Marvell Yukon II (msk) nic; it works fine
>>> whether TSO is on or off.
>>>
>>> I can't reproduce this on a bge nic because it doesn't support TSO :)
>>> That's the only other gigabit nic I've got easy access to.
>>>
>>> I can reproduce this with just a Cisco 877W IOS-based router and no Cisco
>>> PIX / ASA firewalls in the way, with the servers on the LAN interface with
>>> "ip tcp adjust-mss 1340" on it, and the downloading client on the Cisco's
>>> 802.11G interface.  This time, the client is a Macbook Pro running
>>> Leopard, and I'm running "tcpdump -i en1 -s 1500 -n -v length \> 1394" on
>>> the Macbook (not the server this time) to find oversize packets, which is
>>> actually handier because I can see how trashed they really get :)
>>>
>>> I can't reproduce this between two machines on the same subnet (though I
>>> can reproduce throughput problems alone).  I haven't tried lowering the
>>> system MSS on one end yet (is there a sysctl to lower the MSS for outbound
>>> connections without lowering the MTU as well?).  If I could do this it
>>> would greatly simplify testing for everyone as they wouldn't have to stick
>>> an MSS-clamping router in the middle.  It doesn't have to be Cisco.
>>>
>>> With this setup, copying to the Mac through the 877W from:
>>>
>>> msk-based server, TSO disabled: tcpdump reports no problems, file
>>> transfers are fast
>>>
>>> msk-based server, TSO enabled: tcpdump reports no problems, file
>>> transfers are fast
>>>
>>> em-based server, TSO disabled: tcpdump reports no problems, file
>>> transfers are fast
>>>
>>> em-based server, TSO enabled: tcpdump reports numerous oversize packets of
>>> varying sizes just as before, AND numerous packets with bad TCP checksums.
>>> The checksum problems aren't limited to only the large packets though.
>>> (That's probably what's causing the throughput problems.)  Toggling rxcsum
>>> and txcsum flags on the server made no difference.  What I haven't tried
>>> yet is hexdumping the packets to see what exactly is getting trashed.
>>>
>>> The problem still comes and goes; sometimes it'll work for a few minutes
>>> after boot, sometimes not; it might be dependent on what other traffic's
>>> going through the box.
>>
>> Hmmm, OK so the data is pointing to something in the em TSO  or encap
>> code. I will look into this tomorrow. So the necessary elements are systems
>> on two subnets and em doing the transmitting with TSO?

And a sub-1460 MSS on the client end OR the router doing MSS clamping, 
yes.  I can't yet reproduce it with 1500 byte MTU's or between two 
machines on the same subnet.  I definitely haven't done any tests with 
jumbos...

> BTW, not to dodge the problem, but this is a case where I'd say its absurd
> to be using TSO. Is the link at 1G or 100Mb?

It's reproducible at either speed, but I personally am perfectly happy 
leaving TSO disabled on my production boxes -- I've got my workaround, it 
performs, I'm cool.  At this point I'm pursuing a fix more for others' 
benefit because some other people are having at least throughput issues -- 
and for my own weirdo curiosity.

If a fix doesn't make 7.0-RELEASE (and I almost hate to say this) might it 
be worth disabling TSO by default in RELENG_7_0 but back on for RELENG_7?