From owner-freebsd-net@FreeBSD.ORG Sat Mar 13 05:57:47 2010 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 37898106564A; Sat, 13 Mar 2010 05:57:47 +0000 (UTC) (envelope-from steven@uplinklabs.net) Received: from mail-iw0-f185.google.com (mail-iw0-f185.google.com [209.85.223.185]) by mx1.freebsd.org (Postfix) with ESMTP id F404A8FC0C; Sat, 13 Mar 2010 05:57:46 +0000 (UTC) Received: by iwn15 with SMTP id 15so1823438iwn.7 for ; Fri, 12 Mar 2010 21:57:46 -0800 (PST) MIME-Version: 1.0 Received: by 10.231.173.130 with SMTP id p2mr662683ibz.48.1268459865793; Fri, 12 Mar 2010 21:57:45 -0800 (PST) In-Reply-To: References: <201003121754.o2CHsH7V065932@freefall.freebsd.org> Date: Fri, 12 Mar 2010 21:57:45 -0800 Message-ID: From: Steven Noonan To: yongari@freebsd.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: freebsd-net@freebsd.org Subject: Re: kern/144689: [re] TCP transfer corruption using if_re X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 13 Mar 2010 05:57:47 -0000 On Fri, Mar 12, 2010 at 4:24 PM, Steven Noonan wrot= e: > On Fri, Mar 12, 2010 at 4:19 PM, Steven Noonan wr= ote: >> On Fri, Mar 12, 2010 at 9:54 AM, =C2=A0 wrote: >>> Synopsis: [re] TCP transfer corruption using if_re >>> >>> State-Changed-From-To: open->feedback >>> State-Changed-By: yongari >>> State-Changed-When: Fri Mar 12 17:53:37 UTC 2010 >>> State-Changed-Why: >>> This looks like Rx checksum offloading issue. Would you try >>> disabling Rx checksum offloading and test it again? >>> #ifconfig re0 -rxcsum >>> Also show me dmesg output(re(4) related part). >>> >>> >>> Responsible-Changed-From-To: freebsd-net->yongari >>> Responsible-Changed-By: yongari >>> Responsible-Changed-When: Fri Mar 12 17:53:37 UTC 2010 >>> Responsible-Changed-Why: >>> Mine. >>> >>> http://www.freebsd.org/cgi/query-pr.cgi?pr=3D144689 >>> >> >> Hmm. Disabling Rx checksum offloading helped for one clone process, >> but then this showed up in dmesg during my second test (it seems to be >> doing this regularly for some reason): >> re0: link state changed to DOWN >> re0: link state changed to UP >> >> And no, the cable isn't loose or something. It just decides to take >> the interface down and put it back up. >> >> Here's the rest of 'dmesg | grep re0': >> >> firewire0: on fwohci0 >> dcons_crom0: on firewire0 >> fwe0: on firewire0 >> fwip0: on firewire0 >> firewire0: 1 nodes, maxhop <=3D 0 cable IRM irm(0) =C2=A0(me) >> firewire0: bus manager 0 >> re0: >> port 0x1200-0x12ff mem 0x88000000-0x880001ff irq 18 at device 0.0 on >> cardbus0 >> re0: Chip rev. 0x10000000 >> re0: MAC rev. 0x00000000 >> miibus1: on re0 >> re0: Ethernet address: 00:18:4d:6e:c0:29 >> re0: [FILTER] >> re0: link state changed to UP >> re0: link state changed to DOWN >> re0: link state changed to UP >> re0: link state changed to DOWN >> re0: link state changed to UP >> re0: PHY read failed >> re0: link state changed to DOWN >> re0: link state changed to UP >> re0: link state changed to DOWN >> re0: link state changed to UP >> re0: link state changed to DOWN >> re0: link state changed to UP >> re0: PHY read failed >> re0: PHY read failed >> re0: PHY read failed >> re0: link state changed to DOWN >> re0: link state changed to UP >> re0: PHY read failed >> re0: PHY read failed >> re0: PHY read failed >> re0: link state changed to DOWN >> re0: link state changed to UP >> re0: PHY read failed >> re0: PHY read failed >> re0: link state changed to DOWN >> re0: link state changed to UP >> re0: PHY read failed >> re0: PHY read failed >> re0: PHY read failed >> re0: link state changed to DOWN >> re0: link state changed to UP >> re0: PHY read failed >> re0: PHY read failed >> re0: PHY read failed >> re0: link state changed to DOWN >> re0: link state changed to UP >> re0: detached >> re0: >> port 0x1200-0x12ff mem 0x88000000-0x880001ff irq 18 at device 0.0 on >> cardbus0 >> re0: Chip rev. 0x10000000 >> re0: MAC rev. 0x00000000 >> miibus1: on re0 >> re0: Ethernet address: 00:18:4d:6e:c0:29 >> re0: [FILTER] >> re0: link state changed to DOWN >> re0: link state changed to UP >> re0: link state changed to DOWN >> re0: link state changed to UP >> re0: link state changed to DOWN >> re0: link state changed to UP >> re0: link state changed to DOWN >> re0: link state changed to UP >> re0: link state changed to DOWN >> re0: link state changed to UP >> re0: link state changed to DOWN >> re0: link state changed to UP >> re0: PHY read failed >> re0: PHY read failed >> re0: link state changed to DOWN >> re0: link state changed to UP >> re0: PHY read failed >> re0: PHY read failed >> re0: PHY read failed >> re0: PHY read failed >> re0: link state changed to DOWN >> re0: link state changed to UP >> re0: PHY read failed >> re0: link state changed to DOWN >> re0: link state changed to UP >> re0: PHY read failed >> re0: PHY read failed >> re0: PHY read failed >> re0: PHY read failed >> re0: link state changed to DOWN >> re0: link state changed to UP >> re0: PHY read failed >> re0: PHY read failed >> re0: PHY read failed >> re0: PHY read failed >> re0: link state changed to DOWN >> re0: link state changed to UP >> re0: PHY read failed >> >> - Steven >> > > I should note that the connection was _lost_ during the second test above= . > > I also tested again, and it looks like it added another "re0: PHY read > failed" before silently dropping the connection. > > - Steven > I did a couple captures with Wireshark on the client end. One is with rxcsum enabled on the machine running git-daemon, one is without rxcsum. http://www.uplinklabs.net/~tycho/files/git-cap-norxcsum.bz2 http://www.uplinklabs.net/~tycho/files/git-cap.bz2 Obviously, you can look at the data yourself and make more sense of it, but here are things I noticed in the captures: With rxcsum: - There are some silent problems that occur in the middle of the capture. Client-to-server: 'TCP ACKed lost segment' a few times, then 'TCP previous segment lost'. This happens multiple times during the capture (before 'git-upload-pack' starts sending data). - Occasional 'TCP window update's. These are highlighted in black for reasons unknown to me. It seems like this would be normal. - The server calls 'git-upload-pack' and we start seeing a large number of client-to-server TCP RST flags being sent and then the connection gets closed due to some detected data corruption in the transfer. Without rxcsum: - About the same amount of client-to-server 'TCP ACKed lost segment's. - 'git-upload-pack' kicks in and things get _really_ hairy. 'TCP Dup ACK' detected by the client many many times. - Finally, a series of 'TCP retransmission's from server to client happen (which is where the connection hangs). - I closed the connection which triggered the final two 'TCP RST's. Also, I forgot to note in my original report that I checked if there was packet loss by using a ping flood, and one packet in the 1.5 million packets sent was lost. But I'm not sure whether it's checksumming the data of these packets, so they could be coming back with perfectly valid ICMP headers but corrupted data. - Steven