From owner-freebsd-current@FreeBSD.ORG Sun Nov 18 23:26:36 2007 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 69C1D16A417; Sun, 18 Nov 2007 23:26:36 +0000 (UTC) (envelope-from mandrews@bit0.com) Received: from mindcrime.bit0.com (bit0.com [207.246.88.211]) by mx1.freebsd.org (Postfix) with ESMTP id CDEF713C4AC; Sun, 18 Nov 2007 23:26:35 +0000 (UTC) (envelope-from mandrews@bit0.com) Received: from localhost (localhost.bit0.com [127.0.0.1]) by mindcrime.bit0.com (Postfix) with ESMTP id 8222A1E3379; Sun, 18 Nov 2007 18:26:25 -0500 (EST) X-Virus-Scanned: amavisd-new at bit0.com Received: from mindcrime.bit0.com ([127.0.0.1]) by localhost (mindcrime.int.bit0.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id eRYh6LHWaU2P; Sun, 18 Nov 2007 18:26:23 -0500 (EST) Received: from localhost (localhost.bit0.com [127.0.0.1]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mindcrime.bit0.com (Postfix) with ESMTP; Sun, 18 Nov 2007 18:26:23 -0500 (EST) Date: Sun, 18 Nov 2007 18:26:23 -0500 (EST) From: Mike Andrews X-X-Sender: mandrews@mindcrime.int.bit0.com To: Jack Vogel In-Reply-To: <2a41acea0711181140w6707b85p18ac9a483ae367b7@mail.gmail.com> Message-ID: <20071118181625.Y19404@mindcrime.int.bit0.com> References: <20071117003504.R31357@mindcrime.int.bit0.com> <20071117170537.F59492@mindcrime.int.bit0.com> <20071117182232.T59492@mindcrime.int.bit0.com> <473F9552.50402@bit0.com> <473FBD1A.8010207@bit0.com> <20071118030305.N99375@mindcrime.int.bit0.com> <2a41acea0711181133n5f63f932m714a4a6b790937c0@mail.gmail.com> <2a41acea0711181140w6707b85p18ac9a483ae367b7@mail.gmail.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Denis Shaposhnikov , Kip Macy , Mike Silbersack , Andre Oppermann , freebsd-current@freebsd.org Subject: Re: bizarre em + TSO + MSS issue in RELENG_7 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 18 Nov 2007 23:26:36 -0000 On Sun, 18 Nov 2007, Jack Vogel wrote: > On Nov 18, 2007 11:33 AM, Jack Vogel wrote: >> >> On Nov 18, 2007 12:58 AM, Mike Andrews wrote: >>> >>> On Sat, 17 Nov 2007, Mike Andrews wrote: >>> >>>> Kip Macy wrote: >>>>> On Nov 17, 2007 5:28 PM, Mike Andrews wrote: >>>>>> Kip Macy wrote: >>>>>>> On Nov 17, 2007 3:23 PM, Mike Andrews wrote: >>>>>>>> On Sat, 17 Nov 2007, Kip Macy wrote: >>>>>>>> >>>>>>>>> On Nov 17, 2007 2:33 PM, Mike Andrews wrote: >>>>>>>>>> On Sat, 17 Nov 2007, Kip Macy wrote: >>>>>>>>>> >>>>>>>>>>> On Nov 17, 2007 10:33 AM, Denis Shaposhnikov wrote: >>>>>>>>>>>> On Sat, 17 Nov 2007 00:42:54 -0500 (EST) >>>>>>>>>>>> Mike Andrews wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Has anyone run into problems with MSS not being respected when >>>>>>>>>>>>> using >>>>>>>>>>>>> TSO, specifically on em cards? >>>>>>>>>>>> Yes, I wrote about this problem on the beginning of 2007, see >>>>>>>>>>>> >>>>>>>>>>>> http://tinyurl.com/3e5ak5 >>>>>>>>>>>> >>>>>>>>>>> if_em.c:3502 >>>>>>>>>>> /* >>>>>>>>>>> * Payload size per packet w/o any headers. >>>>>>>>>>> * Length of all headers up to payload. >>>>>>>>>>> */ >>>>>>>>>>> TXD->tcp_seg_setup.fields.mss = >>>>>>>>>>> htole16(mp->m_pkthdr.tso_segsz); >>>>>>>>>>> TXD->tcp_seg_setup.fields.hdr_len = hdr_len; >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Please print out the value of tso_segsz here. It appears to be being >>>>>>>>>>> set correctly. The only thing I can think of is that t_maxopd is not >>>>>>>>>>> correct. As tso_segsz is correct here: >>>>>>>>>> It repeatedly prints 1368 during a 1 meg file transfer over a >>>>>>>>>> connection >>>>>>>>>> with a 1380 MSS. Any other printf's I can add? I'm working on a web >>>>>>>>>> page >>>>>>>>>> with tcpdump / firewall log output illustrating the issue... >>>>>>>>> Mike - >>>>>>>>> Denis' tcpdump output doesn't show oversized segments, something else >>>>>>>>> appears to be happening there. Can you post your tcpdump output >>>>>>>>> somewhere? >>>>>>>> URL sent off-list. >>>>>>> if (tso) { >>>>>>> m->m_pkthdr.csum_flags = CSUM_TSO; >>>>>>> m->m_pkthdr.tso_segsz = tp->t_maxopd - optlen; >>>>>>> } >>>>>>> >>>>>>> >>>>>>> Please print the value of maxopd and optlen under "if (tso)" in >>>>>>> tcp_output. I think the calculated optlen may be too small. >>>>>> >>>>>> maxopt=1380 - optlen=12 = tso_segsz=1368 >>>>>> >>>>>> Weird though, after this reboot, I had to re-copy a 4 meg file 5 times >>>>>> to start getting the firewall to log any drops. Transfer rate was >>>>>> around 240KB/sec before the firewall started to drop, then it went down >>>>>> to about 64KB/sec during the 5th copy, and stayed there for subsequent >>>>>> copies. The actual packet size the firewall said it was dropping was >>>>>> varying all over the place still, yet the maxopt/optlen/tso_segsz values >>>>>> stayed constant. But it's interesting that it didn't start dropping >>>>>> immediately after the reboot -- though the transfer rate was still >>>>>> sub-optimal. >>>>> >>>>> Ok, next theory :D. You shouldn't be seeing "bad len" packets from >>>>> tcpdump. I'm wondering if that means you're sending down more than >>>>> 64k. Can you please print out the value of mp->m_pkthdr.len around the >>>>> same place that you printed out tso_segsz? 64k is the generally >>>>> accepted limit for TSO, I'm wondering if the card firmware does >>>>> something weird if you give it more. >>>> >>>> OK. In that last message, where I said it took 5 times to start reproducing >>>> the problem... this time it took until I actually toggled TSO back off and >>>> back on again, and then it started acting up again. I don't know what the >>>> actual trigger is... it's very weird. >>>> >>>> Initially, w/ TSO on and it wasn't dropping yet (but was still transferring >>>> slow)... >>>> >>>> BIT0 DEBUG: tso_segsz=1368 hdr_len=66 mp->m_pkthdr.len=8306 >>>> BIT0 DEBUG: tso_segsz=1368 hdr_len=66 mp->m_pkthdr.len=8306 >>>> BIT0 DEBUG: tso_segsz=1368 hdr_len=66 mp->m_pkthdr.len=8306 >>>> BIT0 DEBUG: tso_segsz=1368 hdr_len=66 mp->m_pkthdr.len=8306 >>>> (etc, always 8306) >>>> >>>> After toggling off/on which caused the drops to start (and the speed to drop >>>> even further): >>>> >>>> BIT0 DEBUG: tso_segsz=1368 hdr_len=66 mp->m_pkthdr.len=7507 >>>> BIT0 DEBUG: tso_segsz=1368 hdr_len=66 mp->m_pkthdr.len=3053 >>>> BIT0 DEBUG: tso_segsz=1368 hdr_len=66 mp->m_pkthdr.len=1677 >>>> BIT0 DEBUG: tso_segsz=1368 hdr_len=66 mp->m_pkthdr.len=3037 >>>> BIT0 DEBUG: tso_segsz=1368 hdr_len=66 mp->m_pkthdr.len=2264 >>>> BIT0 DEBUG: tso_segsz=1368 hdr_len=66 mp->m_pkthdr.len=1656 >>>> BIT0 DEBUG: tso_segsz=1368 hdr_len=66 mp->m_pkthdr.len=1902 >>>> BIT0 DEBUG: tso_segsz=1368 hdr_len=66 mp->m_pkthdr.len=1888 >>>> BIT0 DEBUG: tso_segsz=1368 hdr_len=66 mp->m_pkthdr.len=1640 >>>> BIT0 DEBUG: tso_segsz=1368 hdr_len=66 mp->m_pkthdr.len=1871 >>>> BIT0 DEBUG: tso_segsz=1368 hdr_len=66 mp->m_pkthdr.len=2461 >>>> BIT0 DEBUG: tso_segsz=1368 hdr_len=66 mp->m_pkthdr.len=1849 >>>> BIT0 DEBUG: tso_segsz=1368 hdr_len=66 mp->m_pkthdr.len=2092 >>>> >>>> and so on, with more seemingly random lengths... but none of them ever over >>>> 8306, much less 64K. >>> >>> >>> Got a few more data points here. >>> >>> I can reproduce this on an i386 kernel, so it isn't amd64 specific. >>> >>> I can reproduce this on an 82541EI nic, so it isn't 82573 specific. >>> >>> I can't reproduce this on a Marvell Yukon II (msk) nic; it works fine >>> whether TSO is on or off. >>> >>> I can't reproduce this on a bge nic because it doesn't support TSO :) >>> That's the only other gigabit nic I've got easy access to. >>> >>> I can reproduce this with just a Cisco 877W IOS-based router and no Cisco >>> PIX / ASA firewalls in the way, with the servers on the LAN interface with >>> "ip tcp adjust-mss 1340" on it, and the downloading client on the Cisco's >>> 802.11G interface. This time, the client is a Macbook Pro running >>> Leopard, and I'm running "tcpdump -i en1 -s 1500 -n -v length \> 1394" on >>> the Macbook (not the server this time) to find oversize packets, which is >>> actually handier because I can see how trashed they really get :) >>> >>> I can't reproduce this between two machines on the same subnet (though I >>> can reproduce throughput problems alone). I haven't tried lowering the >>> system MSS on one end yet (is there a sysctl to lower the MSS for outbound >>> connections without lowering the MTU as well?). If I could do this it >>> would greatly simplify testing for everyone as they wouldn't have to stick >>> an MSS-clamping router in the middle. It doesn't have to be Cisco. >>> >>> With this setup, copying to the Mac through the 877W from: >>> >>> msk-based server, TSO disabled: tcpdump reports no problems, file >>> transfers are fast >>> >>> msk-based server, TSO enabled: tcpdump reports no problems, file >>> transfers are fast >>> >>> em-based server, TSO disabled: tcpdump reports no problems, file >>> transfers are fast >>> >>> em-based server, TSO enabled: tcpdump reports numerous oversize packets of >>> varying sizes just as before, AND numerous packets with bad TCP checksums. >>> The checksum problems aren't limited to only the large packets though. >>> (That's probably what's causing the throughput problems.) Toggling rxcsum >>> and txcsum flags on the server made no difference. What I haven't tried >>> yet is hexdumping the packets to see what exactly is getting trashed. >>> >>> The problem still comes and goes; sometimes it'll work for a few minutes >>> after boot, sometimes not; it might be dependent on what other traffic's >>> going through the box. >> >> Hmmm, OK so the data is pointing to something in the em TSO or encap >> code. I will look into this tomorrow. So the necessary elements are systems >> on two subnets and em doing the transmitting with TSO? And a sub-1460 MSS on the client end OR the router doing MSS clamping, yes. I can't yet reproduce it with 1500 byte MTU's or between two machines on the same subnet. I definitely haven't done any tests with jumbos... > BTW, not to dodge the problem, but this is a case where I'd say its absurd > to be using TSO. Is the link at 1G or 100Mb? It's reproducible at either speed, but I personally am perfectly happy leaving TSO disabled on my production boxes -- I've got my workaround, it performs, I'm cool. At this point I'm pursuing a fix more for others' benefit because some other people are having at least throughput issues -- and for my own weirdo curiosity. If a fix doesn't make 7.0-RELEASE (and I almost hate to say this) might it be worth disabling TSO by default in RELENG_7_0 but back on for RELENG_7?