Date: Fri, 15 Oct 2010 13:25:08 +0100 From: Melissa Jenkins <melissa-freebsd@littlebluecar.co.uk> To: pyunyh@gmail.com Cc: freebsd-net@freebsd.org Subject: Re: NFE adapter 'hangs' Message-ID: <9BBD5E0C-06D3-4FA5-B85C-5256DA3AD483@littlebluecar.co.uk> In-Reply-To: <20100904005349.GP21940@michelle.cdnetworks.com> References: <5C261F16-6530-47EE-B1C1-BA38CD6D8B01@littlebluecar.co.uk> <20100902194940.GH21940@michelle.cdnetworks.com> <C536F7C7-A1EA-4BDA-8F5F-E6A5919F6D9A@littlebluecar.co.uk> <20100904005349.GP21940@michelle.cdnetworks.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 4 Sep 2010, at 01:53, Pyun YongHyeon wrote: > On Fri, Sep 03, 2010 at 07:59:26AM +0100, Melissa Jenkins wrote: >>=20 >> Thank you for your very quick response :) >>=20 >=20 > [...] >=20 >>> Also I'd like to know whether both RX and TX are dead or only one >>> RX/TX path is hung. Can you see incoming traffic with tcpdump when >>> you think the controller is in stuck? >>=20 >> Yes, though not very much. The traffic to 4800 is every second so you = can see in the following trace when it stops >>=20 >> 07:10:42.287163 IP 192.168.1.203 > 224.0.0.240: pfsync 108 >> 07:10:42.911995 >> 07:10:43.112073 STP 802.1d, Config, Flags [Topology change], = bridge-id 8000.c4:7d:4f:a9:ac:30.8008, length 43 >> 07:10:43.148659 IP 192.168.1.203.57026 > 192.168.1.255.4800: UDP, = length 60 >> 07:10:43.148684 IP 172.31.1.203 > 172.31.1.129: GREv0, length 92: IP = 192.168.1.203.57026 > 192.168.1.129.4800: UDP, length 60 >> 07:10:43.148689 IP 172.31.1.203 > 172.31.1.129: GREv0, length 92: IP = 192.168.1.203.57026 > 192.168.1.1.4800: UDP, length 60 >> 07:10:43.148918 IP 192.168.1.213.40677 > 192.168.1.255.4800: UDP, = length 48 >=20 > [...] >=20 >> a bit later on, still broken, a slight odd message: >> 07:11:43.079720 IP 172.31.1.129 > 172.31.1.213: GREv0, length 52: IP = 192.168.1.129.60446 > 192.168.1.213.179: tcp 12 [bad hdr length 16 - = too short, < 20] >> 07:11:44.210794 IP 172.31.1.129 > 172.31.1.203: GREv0, length 84: IP = 192.168.1.129.64744 > 192.168.1.203.4800: UDP, length 52 >> 07:11:44.210831 IP 172.31.1.129 > 172.31.1.213: GREv0, length 84: IP = 192.168.1.129.64744 > 192.168.1.213.4800: UDP, length 52 >>=20 >> Now this really is odd, I don't recognise either of those MAC = addresses, though the SQL shown is used on this machine ( >> 07:12:13.054393 45:43:54:20:41:63 > 00:00:03:53:45:4c, ethertype = Unknown (0x6374), length 60: >> 0x0000: 556e 6971 7565 4964 2046 524f 4d20 7261 = UniqueId.FROM.ra >> 0x0010: 6461 6363 7420 2057 4845 5245 2043 616c = dacct..WHERE.Cal >> 0x0020: 6c69 6e67 5374 6174 696f 6e49 6420 = lingStationId. >=20 > Hmm, it seems you're using really complex setup. It's very hard to > narrow down guilty ones under these environments. Could you setup > simple network configuration that reproduces the issue? One of > possible cause would be wrong(garbled) data might be passed up to > upper stack. But I have no idea why you see GRE packets with > truncated TCP header(172.31.1.129 > 172.31.1.213). > How about disabling TX/RX checksum offloading as well as TSO? >=20 > [...] >=20 >>=20 >> I then restarted the interface (nfe down/up, route restart) >>=20 >> =46rom dmesg at the time (slight obfuscated) >> Sep 3 07:10:19 manch2 bgpd[89612]: neighbor XX: received = notification: HoldTimer expired, unknown subcode 0 >> Sep 3 07:10:49 manch2 bgpd[89612]: neighbor XX connect: Host is down >> # at this point I took the interface down & up and reloaded the = routing tables >> Sep 3 07:12:07 manch2 kernel: carp0: link state changed to DOWN >> Sep 3 07:12:07 manch2 kernel: carp0: link state changed to DOWN >> Sep 3 07:12:07 manch2 kernel: nfe0: link state changed to DOWN >> Sep 3 07:12:07 manch2 kernel: carp0: link state changed to DOWN >> Sep 3 07:12:11 manch2 kernel: nfe0: link state changed to UP =20 >> Sep 3 07:12:11 manch2 kernel: carp0: link state changed to DOWN >> Sep 3 07:12:14 manch2 kernel: carp0: link state changed to UP >=20 > Hmm, it does not look right, carp0 showed link DOWN message four > times in a row. > By the way, are you using IPMI on MCP55? nfe(4) is not ready to > handle MAC operation with IPMI. Turning off tx & rc checksum offloading seems to have resolved the = problem: ifconfig nfe0 -txcsum -rxcsum Seems to have stopped both the corruption and the interface hanging. I = ran it for about 16 hours on the FreeBSD 8 box. It also appears to have = fixed the problem on my FreeBSD 7 machine as well. =20 I didn't try turning off TSO. Thank you for your suggestion & help! Mel
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?9BBD5E0C-06D3-4FA5-B85C-5256DA3AD483>