From owner-freebsd-current@FreeBSD.ORG  Sun Nov 18 08:59:11 2007
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 71E5016A418;
	Sun, 18 Nov 2007 08:59:11 +0000 (UTC)
	(envelope-from mandrews@bit0.com)
Received: from mindcrime.bit0.com (bit0.com [207.246.88.211])
	by mx1.freebsd.org (Postfix) with ESMTP id CBE8E13C45A;
	Sun, 18 Nov 2007 08:59:10 +0000 (UTC)
	(envelope-from mandrews@bit0.com)
Received: from localhost (localhost.bit0.com [127.0.0.1])
	by mindcrime.bit0.com (Postfix) with ESMTP id D11721E3378;
	Sun, 18 Nov 2007 03:59:01 -0500 (EST)
X-Virus-Scanned: amavisd-new at bit0.com
Received: from mindcrime.bit0.com ([127.0.0.1])
	by localhost (mindcrime.int.bit0.com [127.0.0.1]) (amavisd-new,
	port 10024)
	with ESMTP id jVuiiaifTZ7V; Sun, 18 Nov 2007 03:58:58 -0500 (EST)
Received: from localhost (localhost.bit0.com [127.0.0.1])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mindcrime.bit0.com (Postfix) with ESMTP;
	Sun, 18 Nov 2007 03:58:58 -0500 (EST)
Date: Sun, 18 Nov 2007 03:58:58 -0500 (EST)
From: Mike Andrews <mandrews@bit0.com>
X-X-Sender: mandrews@mindcrime.int.bit0.com
To: Kip Macy <kip.macy@gmail.com>
In-Reply-To: <473FBD1A.8010207@bit0.com>
Message-ID: <20071118030305.N99375@mindcrime.int.bit0.com>
References: <20071117003504.R31357@mindcrime.int.bit0.com>
	<20071117213316.499be43b@vlink.ru>
	<b1fa29170711171308x62a6371dnbb939748c5c59ae2@mail.gmail.com>
	<20071117170537.F59492@mindcrime.int.bit0.com>
	<b1fa29170711171519r65473426s1b9f3d9666ff6a92@mail.gmail.com>
	<20071117182232.T59492@mindcrime.int.bit0.com>
	<b1fa29170711171619x24233a3cw4361e0f3ca395e4c@mail.gmail.com>
	<473F9552.50402@bit0.com>
	<b1fa29170711171804x36e4ae51ie03d01e4bc0220ac@mail.gmail.com>
	<473FBD1A.8010207@bit0.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Denis Shaposhnikov <dsh@vlink.ru>, Mike Silbersack <silby@freebsd.org>,
	Andre Oppermann <andre@freebsd.org>, freebsd-current@freebsd.org
Subject: Re: bizarre em + TSO + MSS issue in RELENG_7
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 18 Nov 2007 08:59:11 -0000

On Sat, 17 Nov 2007, Mike Andrews wrote:

> Kip Macy wrote:
>> On Nov 17, 2007 5:28 PM, Mike Andrews <mandrews@bit0.com> wrote:
>>> Kip Macy wrote:
>>>> On Nov 17, 2007 3:23 PM, Mike Andrews <mandrews@bit0.com> wrote:
>>>>> On Sat, 17 Nov 2007, Kip Macy wrote:
>>>>> 
>>>>>> On Nov 17, 2007 2:33 PM, Mike Andrews <mandrews@bit0.com> wrote:
>>>>>>> On Sat, 17 Nov 2007, Kip Macy wrote:
>>>>>>> 
>>>>>>>> On Nov 17, 2007 10:33 AM, Denis Shaposhnikov <dsh@vlink.ru> wrote:
>>>>>>>>> On Sat, 17 Nov 2007 00:42:54 -0500 (EST)
>>>>>>>>> Mike Andrews <mandrews@bit0.com> wrote:
>>>>>>>>> 
>>>>>>>>>> Has anyone run into problems with MSS not being respected when 
>>>>>>>>>> using
>>>>>>>>>> TSO, specifically on em cards?
>>>>>>>>> Yes, I wrote about this problem on the beginning of 2007, see
>>>>>>>>>
>>>>>>>>>     http://tinyurl.com/3e5ak5
>>>>>>>>> 
>>>>>>>> if_em.c:3502
>>>>>>>>        /*
>>>>>>>>         * Payload size per packet w/o any headers.
>>>>>>>>         * Length of all headers up to payload.
>>>>>>>>         */
>>>>>>>>        TXD->tcp_seg_setup.fields.mss = 
>>>>>>>> htole16(mp->m_pkthdr.tso_segsz);
>>>>>>>>        TXD->tcp_seg_setup.fields.hdr_len = hdr_len;
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Please print out the value of tso_segsz here. It appears to be being
>>>>>>>> set correctly. The only thing I can think of is that t_maxopd is not
>>>>>>>> correct. As tso_segsz is correct here:
>>>>>>> It repeatedly prints 1368 during a 1 meg file transfer over a 
>>>>>>> connection
>>>>>>> with a 1380 MSS.  Any other printf's I can add?  I'm working on a web 
>>>>>>> page
>>>>>>> with tcpdump / firewall log output illustrating the issue...
>>>>>> Mike -
>>>>>> Denis' tcpdump output doesn't show oversized segments, something else
>>>>>> appears to be happening there. Can you post your tcpdump output
>>>>>> somewhere?
>>>>> URL sent off-list.
>>>>        if (tso) {
>>>>                m->m_pkthdr.csum_flags = CSUM_TSO;
>>>>                m->m_pkthdr.tso_segsz = tp->t_maxopd - optlen;
>>>>        }
>>>> 
>>>> 
>>>> Please print the value of maxopd and optlen under "if (tso)" in
>>>> tcp_output. I think the calculated optlen may be too small.
>>> 
>>> maxopt=1380 - optlen=12 = tso_segsz=1368
>>> 
>>> Weird though, after this reboot, I had to re-copy a 4 meg file 5 times
>>> to start getting the firewall to log any drops.  Transfer rate was
>>> around 240KB/sec before the firewall started to drop, then it went down
>>> to about 64KB/sec during the 5th copy, and stayed there for subsequent
>>> copies.  The actual packet size the firewall said it was dropping was
>>> varying all over the place still, yet the maxopt/optlen/tso_segsz values
>>> stayed constant.  But it's interesting that it didn't start dropping
>>> immediately after the reboot -- though the transfer rate was still
>>> sub-optimal.
>> 
>> Ok, next theory :D. You shouldn't be seeing "bad len" packets from
>> tcpdump. I'm wondering if that means you're sending down more than
>> 64k. Can you please print out the value of mp->m_pkthdr.len around the
>> same place that you printed out tso_segsz? 64k is the generally
>> accepted limit for TSO, I'm wondering if the card firmware does
>> something weird if you give it more.
>
> OK.  In that last message, where I said it took 5 times to start reproducing 
> the problem... this time it took until I actually toggled TSO back off and 
> back on again, and then it started acting up again.  I don't know what the 
> actual trigger is... it's very weird.
>
> Initially, w/ TSO on and it wasn't dropping yet (but was still transferring 
> slow)...
>
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=8306
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=8306
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=8306
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=8306
> (etc, always 8306)
>
> After toggling off/on which caused the drops to start (and the speed to drop 
> even further):
>
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=7507
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=3053
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=1677
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=3037
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=2264
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=1656
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=1902
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=1888
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=1640
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=1871
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=2461
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=1849
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=2092
>
> and so on, with more seemingly random lengths... but none of them ever over 
> 8306, much less 64K.


Got a few more data points here.

I can reproduce this on an i386 kernel, so it isn't amd64 specific.

I can reproduce this on an 82541EI nic, so it isn't 82573 specific.

I can't reproduce this on a Marvell Yukon II (msk) nic; it works fine 
whether TSO is on or off.

I can't reproduce this on a bge nic because it doesn't support TSO :)
That's the only other gigabit nic I've got easy access to.

I can reproduce this with just a Cisco 877W IOS-based router and no Cisco 
PIX / ASA firewalls in the way, with the servers on the LAN interface with 
"ip tcp adjust-mss 1340" on it, and the downloading client on the Cisco's 
802.11G interface.  This time, the client is a Macbook Pro running 
Leopard, and I'm running "tcpdump -i en1 -s 1500 -n -v length \> 1394" on 
the Macbook (not the server this time) to find oversize packets, which is 
actually handier because I can see how trashed they really get :)

I can't reproduce this between two machines on the same subnet (though I 
can reproduce throughput problems alone).  I haven't tried lowering the 
system MSS on one end yet (is there a sysctl to lower the MSS for outbound 
connections without lowering the MTU as well?).  If I could do this it 
would greatly simplify testing for everyone as they wouldn't have to stick 
an MSS-clamping router in the middle.  It doesn't have to be Cisco.

With this setup, copying to the Mac through the 877W from:

msk-based server, TSO disabled: tcpdump reports no problems, file 
transfers are fast

msk-based server, TSO enabled: tcpdump reports no problems, file 
transfers are fast

em-based server, TSO disabled: tcpdump reports no problems, file 
transfers are fast

em-based server, TSO enabled: tcpdump reports numerous oversize packets of 
varying sizes just as before, AND numerous packets with bad TCP checksums. 
The checksum problems aren't limited to only the large packets though. 
(That's probably what's causing the throughput problems.)  Toggling rxcsum 
and txcsum flags on the server made no difference.  What I haven't tried 
yet is hexdumping the packets to see what exactly is getting trashed.

The problem still comes and goes; sometimes it'll work for a few minutes 
after boot, sometimes not; it might be dependent on what other traffic's 
going through the box.