Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 15 Oct 2010 13:25:08 +0100
From:      Melissa Jenkins <melissa-freebsd@littlebluecar.co.uk>
To:        pyunyh@gmail.com
Cc:        freebsd-net@freebsd.org
Subject:   Re: NFE adapter 'hangs'
Message-ID:  <9BBD5E0C-06D3-4FA5-B85C-5256DA3AD483@littlebluecar.co.uk>
In-Reply-To: <20100904005349.GP21940@michelle.cdnetworks.com>
References:  <5C261F16-6530-47EE-B1C1-BA38CD6D8B01@littlebluecar.co.uk> <20100902194940.GH21940@michelle.cdnetworks.com> <C536F7C7-A1EA-4BDA-8F5F-E6A5919F6D9A@littlebluecar.co.uk> <20100904005349.GP21940@michelle.cdnetworks.com>

next in thread | previous in thread | raw e-mail | index | archive | help

On 4 Sep 2010, at 01:53, Pyun YongHyeon wrote:

> On Fri, Sep 03, 2010 at 07:59:26AM +0100, Melissa Jenkins wrote:
>>=20
>> Thank you for your very quick response :)
>>=20
>=20
> [...]
>=20
>>> Also I'd like to know whether both RX and TX are dead or only one
>>> RX/TX path is hung. Can you see incoming traffic with tcpdump when
>>> you think the controller is in stuck?
>>=20
>> Yes, though not very much. The traffic to 4800 is every second so you =
can see in the following trace when it stops
>>=20
>> 07:10:42.287163 IP 192.168.1.203 > 224.0.0.240:  pfsync 108
>> 07:10:42.911995
>> 07:10:43.112073 STP 802.1d, Config, Flags [Topology change], =
bridge-id 8000.c4:7d:4f:a9:ac:30.8008, length 43
>> 07:10:43.148659 IP 192.168.1.203.57026 > 192.168.1.255.4800: UDP, =
length 60
>> 07:10:43.148684 IP 172.31.1.203 > 172.31.1.129: GREv0, length 92: IP =
192.168.1.203.57026 > 192.168.1.129.4800: UDP, length 60
>> 07:10:43.148689 IP 172.31.1.203 > 172.31.1.129: GREv0, length 92: IP =
192.168.1.203.57026 > 192.168.1.1.4800: UDP, length 60
>> 07:10:43.148918 IP 192.168.1.213.40677 > 192.168.1.255.4800: UDP, =
length 48
>=20
> [...]
>=20
>> a bit later on, still broken, a slight odd message:
>> 07:11:43.079720 IP 172.31.1.129 > 172.31.1.213: GREv0, length 52: IP =
192.168.1.129.60446 > 192.168.1.213.179:  tcp 12 [bad hdr length 16 - =
too short, < 20]
>> 07:11:44.210794 IP 172.31.1.129 > 172.31.1.203: GREv0, length 84: IP =
192.168.1.129.64744 > 192.168.1.203.4800: UDP, length 52
>> 07:11:44.210831 IP 172.31.1.129 > 172.31.1.213: GREv0, length 84: IP =
192.168.1.129.64744 > 192.168.1.213.4800: UDP, length 52
>>=20
>> Now this really is odd, I don't recognise either of those MAC =
addresses, though the SQL shown is used on this machine (
>> 07:12:13.054393 45:43:54:20:41:63 > 00:00:03:53:45:4c, ethertype =
Unknown (0x6374), length 60:
>>        0x0000:  556e 6971 7565 4964 2046 524f 4d20 7261  =
UniqueId.FROM.ra
>>        0x0010:  6461 6363 7420 2057 4845 5245 2043 616c  =
dacct..WHERE.Cal
>>        0x0020:  6c69 6e67 5374 6174 696f 6e49 6420       =
lingStationId.
>=20
> Hmm, it seems you're using really complex setup. It's very hard to
> narrow down guilty ones under these environments. Could you setup
> simple network configuration that reproduces the issue? One of
> possible cause would be wrong(garbled) data might be passed up to
> upper stack. But I have no idea why you see GRE packets with
> truncated TCP header(172.31.1.129 > 172.31.1.213).
> How about disabling TX/RX checksum offloading as well as TSO?
>=20
> [...]
>=20
>>=20
>> I then restarted the interface (nfe down/up, route restart)
>>=20
>> =46rom dmesg at the time (slight obfuscated)
>> Sep  3 07:10:19 manch2 bgpd[89612]: neighbor XX: received =
notification: HoldTimer expired, unknown subcode 0
>> Sep  3 07:10:49 manch2 bgpd[89612]: neighbor XX connect: Host is down
>> # at this point I took the interface down & up and reloaded the =
routing tables
>> Sep  3 07:12:07 manch2 kernel: carp0: link state changed to DOWN
>> Sep  3 07:12:07 manch2 kernel: carp0: link state changed to DOWN
>> Sep  3 07:12:07 manch2 kernel: nfe0: link state changed to DOWN
>> Sep  3 07:12:07 manch2 kernel: carp0: link state changed to DOWN
>> Sep  3 07:12:11 manch2 kernel: nfe0: link state changed to UP  =20
>> Sep  3 07:12:11 manch2 kernel: carp0: link state changed to DOWN
>> Sep  3 07:12:14 manch2 kernel: carp0: link state changed to UP
>=20
> Hmm, it does not look right, carp0 showed link DOWN message four
> times in a row.
> By the way, are you using IPMI on MCP55? nfe(4) is not ready to
> handle MAC operation with IPMI.


Turning off tx & rc checksum offloading seems to have resolved the =
problem:

ifconfig nfe0 -txcsum -rxcsum

Seems to have stopped both the corruption and the interface hanging.  I =
ran it for about 16 hours on the FreeBSD 8 box.  It also appears to have =
fixed the problem on my FreeBSD 7 machine as well. =20

I didn't try turning off TSO.

Thank you for your suggestion & help!
Mel





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?9BBD5E0C-06D3-4FA5-B85C-5256DA3AD483>