From owner-freebsd-net@FreeBSD.ORG Sat Mar 13 12:18:56 2010 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4ECCE1065673; Sat, 13 Mar 2010 12:18:56 +0000 (UTC) (envelope-from steven@uplinklabs.net) Received: from mail-iw0-f185.google.com (mail-iw0-f185.google.com [209.85.223.185]) by mx1.freebsd.org (Postfix) with ESMTP id A547F8FC19; Sat, 13 Mar 2010 12:18:32 +0000 (UTC) Received: by iwn15 with SMTP id 15so1957466iwn.7 for ; Sat, 13 Mar 2010 04:18:32 -0800 (PST) MIME-Version: 1.0 Received: by 10.231.79.136 with SMTP id p8mr921463ibk.4.1268482710736; Sat, 13 Mar 2010 04:18:30 -0800 (PST) In-Reply-To: References: <201003121754.o2CHsH7V065932@freefall.freebsd.org> Date: Sat, 13 Mar 2010 04:18:30 -0800 Message-ID: From: Steven Noonan To: yongari@freebsd.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: freebsd-net@freebsd.org Subject: Re: kern/144689: [re] TCP transfer corruption using if_re X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 13 Mar 2010 12:18:56 -0000 On Fri, Mar 12, 2010 at 9:57 PM, Steven Noonan wrot= e: > On Fri, Mar 12, 2010 at 4:24 PM, Steven Noonan wr= ote: >> On Fri, Mar 12, 2010 at 4:19 PM, Steven Noonan w= rote: >>> On Fri, Mar 12, 2010 at 9:54 AM, =C2=A0 wrote: >>>> Synopsis: [re] TCP transfer corruption using if_re >>>> >>>> State-Changed-From-To: open->feedback >>>> State-Changed-By: yongari >>>> State-Changed-When: Fri Mar 12 17:53:37 UTC 2010 >>>> State-Changed-Why: >>>> This looks like Rx checksum offloading issue. Would you try >>>> disabling Rx checksum offloading and test it again? >>>> #ifconfig re0 -rxcsum >>>> Also show me dmesg output(re(4) related part). >>>> >>>> >>>> Responsible-Changed-From-To: freebsd-net->yongari >>>> Responsible-Changed-By: yongari >>>> Responsible-Changed-When: Fri Mar 12 17:53:37 UTC 2010 >>>> Responsible-Changed-Why: >>>> Mine. >>>> >>>> http://www.freebsd.org/cgi/query-pr.cgi?pr=3D144689 >>>> >>> >>> Hmm. Disabling Rx checksum offloading helped for one clone process, >>> but then this showed up in dmesg during my second test (it seems to be >>> doing this regularly for some reason): >>> re0: link state changed to DOWN >>> re0: link state changed to UP >>> >>> And no, the cable isn't loose or something. It just decides to take >>> the interface down and put it back up. >>> >>> Here's the rest of 'dmesg | grep re0': >>> >>> firewire0: on fwohci0 >>> dcons_crom0: on firewire0 >>> fwe0: on firewire0 >>> fwip0: on firewire0 >>> firewire0: 1 nodes, maxhop <=3D 0 cable IRM irm(0) =C2=A0(me) >>> firewire0: bus manager 0 >>> re0: >>> port 0x1200-0x12ff mem 0x88000000-0x880001ff irq 18 at device 0.0 on >>> cardbus0 >>> re0: Chip rev. 0x10000000 >>> re0: MAC rev. 0x00000000 >>> miibus1: on re0 >>> re0: Ethernet address: 00:18:4d:6e:c0:29 >>> re0: [FILTER] >>> re0: link state changed to UP >>> re0: link state changed to DOWN >>> re0: link state changed to UP >>> re0: link state changed to DOWN >>> re0: link state changed to UP >>> re0: PHY read failed >>> re0: link state changed to DOWN >>> re0: link state changed to UP >>> re0: link state changed to DOWN >>> re0: link state changed to UP >>> re0: link state changed to DOWN >>> re0: link state changed to UP >>> re0: PHY read failed >>> re0: PHY read failed >>> re0: PHY read failed >>> re0: link state changed to DOWN >>> re0: link state changed to UP >>> re0: PHY read failed >>> re0: PHY read failed >>> re0: PHY read failed >>> re0: link state changed to DOWN >>> re0: link state changed to UP >>> re0: PHY read failed >>> re0: PHY read failed >>> re0: link state changed to DOWN >>> re0: link state changed to UP >>> re0: PHY read failed >>> re0: PHY read failed >>> re0: PHY read failed >>> re0: link state changed to DOWN >>> re0: link state changed to UP >>> re0: PHY read failed >>> re0: PHY read failed >>> re0: PHY read failed >>> re0: link state changed to DOWN >>> re0: link state changed to UP >>> re0: detached >>> re0: >>> port 0x1200-0x12ff mem 0x88000000-0x880001ff irq 18 at device 0.0 on >>> cardbus0 >>> re0: Chip rev. 0x10000000 >>> re0: MAC rev. 0x00000000 >>> miibus1: on re0 >>> re0: Ethernet address: 00:18:4d:6e:c0:29 >>> re0: [FILTER] >>> re0: link state changed to DOWN >>> re0: link state changed to UP >>> re0: link state changed to DOWN >>> re0: link state changed to UP >>> re0: link state changed to DOWN >>> re0: link state changed to UP >>> re0: link state changed to DOWN >>> re0: link state changed to UP >>> re0: link state changed to DOWN >>> re0: link state changed to UP >>> re0: link state changed to DOWN >>> re0: link state changed to UP >>> re0: PHY read failed >>> re0: PHY read failed >>> re0: link state changed to DOWN >>> re0: link state changed to UP >>> re0: PHY read failed >>> re0: PHY read failed >>> re0: PHY read failed >>> re0: PHY read failed >>> re0: link state changed to DOWN >>> re0: link state changed to UP >>> re0: PHY read failed >>> re0: link state changed to DOWN >>> re0: link state changed to UP >>> re0: PHY read failed >>> re0: PHY read failed >>> re0: PHY read failed >>> re0: PHY read failed >>> re0: link state changed to DOWN >>> re0: link state changed to UP >>> re0: PHY read failed >>> re0: PHY read failed >>> re0: PHY read failed >>> re0: PHY read failed >>> re0: link state changed to DOWN >>> re0: link state changed to UP >>> re0: PHY read failed >>> >>> - Steven >>> >> >> I should note that the connection was _lost_ during the second test abov= e. >> >> I also tested again, and it looks like it added another "re0: PHY read >> failed" before silently dropping the connection. >> >> - Steven >> > > I did a couple captures with Wireshark on the client end. One is with > rxcsum enabled on the machine running git-daemon, one is without > rxcsum. > > http://www.uplinklabs.net/~tycho/files/git-cap-norxcsum.bz2 > http://www.uplinklabs.net/~tycho/files/git-cap.bz2 > > Obviously, you can look at the data yourself and make more sense of > it, but here are things I noticed in the captures: > > With rxcsum: > - There are some silent problems that occur in the middle of the > capture. Client-to-server: 'TCP ACKed lost segment' a few times, then > 'TCP previous segment lost'. This happens multiple times during the > capture (before 'git-upload-pack' starts sending data). > - Occasional 'TCP window update's. These are highlighted in black for > reasons unknown to me. It seems like this would be normal. > - The server calls 'git-upload-pack' and we start seeing a large > number of client-to-server TCP RST flags being sent and then the > connection gets closed due to some detected data corruption in the > transfer. > > Without rxcsum: > - About the same amount of client-to-server 'TCP ACKed lost segment's. > - 'git-upload-pack' kicks in and things get _really_ hairy. 'TCP Dup > ACK' detected by the client many many times. > - Finally, a series of 'TCP retransmission's from server to client > happen (which is where the connection hangs). > - I closed the connection which triggered the final two 'TCP RST's. > > Also, I forgot to note in my original report that I checked if there > was packet loss by using a ping flood, and one packet in the 1.5 > million packets sent was lost. But I'm not sure whether it's > checksumming the data of these packets, so they could be coming back > with perfectly valid ICMP headers but corrupted data. > Also, hilariously horrible hack: - On the server machine, start git-daemon listening on 127.0.0.1:9418. - On the server machine, run 'ssh -L :9418:127.0.0.1:9418 user@localhost'. Then remote git clones work as expected. Very strange. It will have to do until I get a less insane solution. I don't understand why it makes a difference. Is git-daemon using TCP socket options that causes this network interface driver to malfunction? - Steven