Date: Tue, 16 Mar 2010 11:23:22 -0700 From: Pyun YongHyeon <pyunyh@gmail.com> To: Steven Noonan <steven@uplinklabs.net> Cc: freebsd-net@freebsd.org, bug-followup@FreeBSD.org, yongari@freebsd.org Subject: Re: kern/144689: [re] TCP transfer corruption using if_re Message-ID: <20100316182322.GF2001@michelle.cdnetworks.com> In-Reply-To: <f488382f1003130418s116e9c1frfd210db4127b4a9@mail.gmail.com> References: <201003121754.o2CHsH7V065932@freefall.freebsd.org> <f488382f1003121619y17780ed9x52765b9a9133fb2@mail.gmail.com> <f488382f1003121624j34a8aee8kc127e82c08c3fe37@mail.gmail.com> <f488382f1003122157i12968043h31c8020007f7e8a1@mail.gmail.com> <f488382f1003130418s116e9c1frfd210db4127b4a9@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Mar 13, 2010 at 04:18:30AM -0800, Steven Noonan wrote: > On Fri, Mar 12, 2010 at 9:57 PM, Steven Noonan <steven@uplinklabs.net> wrote: > > On Fri, Mar 12, 2010 at 4:24 PM, Steven Noonan <steven@uplinklabs.net> wrote: > >> On Fri, Mar 12, 2010 at 4:19 PM, Steven Noonan <steven@uplinklabs.net> wrote: > >>> On Fri, Mar 12, 2010 at 9:54 AM, ??<yongari@freebsd.org> wrote: > >>>> Synopsis: [re] TCP transfer corruption using if_re > >>>> > >>>> State-Changed-From-To: open->feedback > >>>> State-Changed-By: yongari > >>>> State-Changed-When: Fri Mar 12 17:53:37 UTC 2010 > >>>> State-Changed-Why: > >>>> This looks like Rx checksum offloading issue. Would you try > >>>> disabling Rx checksum offloading and test it again? > >>>> #ifconfig re0 -rxcsum > >>>> Also show me dmesg output(re(4) related part). > >>>> > >>>> > >>>> Responsible-Changed-From-To: freebsd-net->yongari > >>>> Responsible-Changed-By: yongari > >>>> Responsible-Changed-When: Fri Mar 12 17:53:37 UTC 2010 > >>>> Responsible-Changed-Why: > >>>> Mine. > >>>> > >>>> http://www.freebsd.org/cgi/query-pr.cgi?pr=144689 > >>>> > >>> > >>> Hmm. Disabling Rx checksum offloading helped for one clone process, > >>> but then this showed up in dmesg during my second test (it seems to be > >>> doing this regularly for some reason): > >>> re0: link state changed to DOWN > >>> re0: link state changed to UP > >>> > >>> And no, the cable isn't loose or something. It just decides to take > >>> the interface down and put it back up. > >>> > >>> Here's the rest of 'dmesg | grep re0': > >>> > >>> firewire0: <IEEE1394(FireWire) bus> on fwohci0 > >>> dcons_crom0: <dcons configuration ROM> on firewire0 > >>> fwe0: <Ethernet over FireWire> on firewire0 > >>> fwip0: <IP over FireWire> on firewire0 > >>> firewire0: 1 nodes, maxhop <= 0 cable IRM irm(0) ??(me) > >>> firewire0: bus manager 0 > >>> re0: <RealTek 8169/8169S/8169SB(L)/8110S/8110SB(L) Gigabit Ethernet> > >>> port 0x1200-0x12ff mem 0x88000000-0x880001ff irq 18 at device 0.0 on > >>> cardbus0 > >>> re0: Chip rev. 0x10000000 > >>> re0: MAC rev. 0x00000000 > >>> miibus1: <MII bus> on re0 > >>> re0: Ethernet address: 00:18:4d:6e:c0:29 > >>> re0: [FILTER] > >>> re0: link state changed to UP > >>> re0: link state changed to DOWN > >>> re0: link state changed to UP > >>> re0: link state changed to DOWN > >>> re0: link state changed to UP > >>> re0: PHY read failed > >>> re0: link state changed to DOWN > >>> re0: link state changed to UP > >>> re0: link state changed to DOWN > >>> re0: link state changed to UP > >>> re0: link state changed to DOWN > >>> re0: link state changed to UP > >>> re0: PHY read failed > >>> re0: PHY read failed > >>> re0: PHY read failed > >>> re0: link state changed to DOWN > >>> re0: link state changed to UP > >>> re0: PHY read failed > >>> re0: PHY read failed > >>> re0: PHY read failed > >>> re0: link state changed to DOWN > >>> re0: link state changed to UP > >>> re0: PHY read failed > >>> re0: PHY read failed > >>> re0: link state changed to DOWN > >>> re0: link state changed to UP > >>> re0: PHY read failed > >>> re0: PHY read failed > >>> re0: PHY read failed > >>> re0: link state changed to DOWN > >>> re0: link state changed to UP > >>> re0: PHY read failed > >>> re0: PHY read failed > >>> re0: PHY read failed > >>> re0: link state changed to DOWN > >>> re0: link state changed to UP > >>> re0: detached > >>> re0: <RealTek 8169/8169S/8169SB(L)/8110S/8110SB(L) Gigabit Ethernet> > >>> port 0x1200-0x12ff mem 0x88000000-0x880001ff irq 18 at device 0.0 on > >>> cardbus0 > >>> re0: Chip rev. 0x10000000 > >>> re0: MAC rev. 0x00000000 > >>> miibus1: <MII bus> on re0 > >>> re0: Ethernet address: 00:18:4d:6e:c0:29 > >>> re0: [FILTER] > >>> re0: link state changed to DOWN > >>> re0: link state changed to UP > >>> re0: link state changed to DOWN > >>> re0: link state changed to UP > >>> re0: link state changed to DOWN > >>> re0: link state changed to UP > >>> re0: link state changed to DOWN > >>> re0: link state changed to UP > >>> re0: link state changed to DOWN > >>> re0: link state changed to UP > >>> re0: link state changed to DOWN > >>> re0: link state changed to UP > >>> re0: PHY read failed > >>> re0: PHY read failed > >>> re0: link state changed to DOWN > >>> re0: link state changed to UP > >>> re0: PHY read failed > >>> re0: PHY read failed > >>> re0: PHY read failed > >>> re0: PHY read failed > >>> re0: link state changed to DOWN > >>> re0: link state changed to UP > >>> re0: PHY read failed > >>> re0: link state changed to DOWN > >>> re0: link state changed to UP > >>> re0: PHY read failed > >>> re0: PHY read failed > >>> re0: PHY read failed > >>> re0: PHY read failed > >>> re0: link state changed to DOWN > >>> re0: link state changed to UP > >>> re0: PHY read failed > >>> re0: PHY read failed > >>> re0: PHY read failed > >>> re0: PHY read failed > >>> re0: link state changed to DOWN > >>> re0: link state changed to UP > >>> re0: PHY read failed > >>> > >>> - Steven > >>> > >> > >> I should note that the connection was _lost_ during the second test above. > >> > >> I also tested again, and it looks like it added another "re0: PHY read > >> failed" before silently dropping the connection. > >> > >> - Steven > >> > > > > I did a couple captures with Wireshark on the client end. One is with > > rxcsum enabled on the machine running git-daemon, one is without > > rxcsum. > > > > http://www.uplinklabs.net/~tycho/files/git-cap-norxcsum.bz2 > > http://www.uplinklabs.net/~tycho/files/git-cap.bz2 > > > > Obviously, you can look at the data yourself and make more sense of > > it, but here are things I noticed in the captures: > > > > With rxcsum: > > - There are some silent problems that occur in the middle of the > > capture. Client-to-server: 'TCP ACKed lost segment' a few times, then > > 'TCP previous segment lost'. This happens multiple times during the > > capture (before 'git-upload-pack' starts sending data). > > - Occasional 'TCP window update's. These are highlighted in black for > > reasons unknown to me. It seems like this would be normal. > > - The server calls 'git-upload-pack' and we start seeing a large > > number of client-to-server TCP RST flags being sent and then the > > connection gets closed due to some detected data corruption in the > > transfer. > > > > Without rxcsum: > > - About the same amount of client-to-server 'TCP ACKed lost segment's. > > - 'git-upload-pack' kicks in and things get _really_ hairy. 'TCP Dup > > ACK' detected by the client many many times. > > - Finally, a series of 'TCP retransmission's from server to client > > happen (which is where the connection hangs). > > - I closed the connection which triggered the final two 'TCP RST's. > > > > Also, I forgot to note in my original report that I checked if there > > was packet loss by using a ping flood, and one packet in the 1.5 > > million packets sent was lost. But I'm not sure whether it's > > checksumming the data of these packets, so they could be coming back > > with perfectly valid ICMP headers but corrupted data. > > > > Also, hilariously horrible hack: > > - On the server machine, start git-daemon listening on 127.0.0.1:9418. > - On the server machine, run 'ssh -L <public IP>:9418:127.0.0.1:9418 > user@localhost'. > > Then remote git clones work as expected. Very strange. It will have to > do until I get a less insane solution. > The real issue looks like PHY read failure which can result in unexpected behavior. I don't see rgephy(4) related message here, would you show me the output of "devinfo -rv | grep phy"? By chance are you using PCMCIA ethernet controller? > I don't understand why it makes a difference. Is git-daemon using TCP > socket options that causes this network interface driver to > malfunction? > No, I don't think so. It would be a bug in driver. > - Steven
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100316182322.GF2001>