FreeBSD Mail Archives

Date:      Fri, 3 Sep 2010 07:59:26 +0100
From:      Melissa Jenkins <melissa-freebsd@littlebluecar.co.uk>
To:        freebsd-net@freebsd.org
Subject:   Re: NFE adapter 'hangs'
Message-ID:  <C536F7C7-A1EA-4BDA-8F5F-E6A5919F6D9A@littlebluecar.co.uk>
In-Reply-To: <20100902194940.GH21940@michelle.cdnetworks.com>
References:  <5C261F16-6530-47EE-B1C1-BA38CD6D8B01@littlebluecar.co.uk> <20100902194940.GH21940@michelle.cdnetworks.com>


Thank you for your very quick response :)

>First of all, if you have 'route-to' rules in pf please disable TSO
>on nfe(4). This is a known pf(4) issue for handling TSO frames.

There is route-to in the PF ruleset but it is not evaluated, and I've =
confirmed the rules is not matched

@0 pass in on nfe0 route-to (gif1000 X) inet from X/26 to ! =
<internalnets> flags S/SA keep state
  [ Evaluations: 126538    Packets: 0         Bytes: 0           States: =
0     ]
  [ Inserted: uid 0 pid 79327 ]


>If this is not the case, would you show me the output of
>"sysctl dev.nfe.0.stats"? Can you see these statistics are still
>updated at the time of hang?

Before:

M [root@X:~]# sysctl dev.nfe.0.stats
dev.nfe.0.stats.rx.frame_errors: 0
dev.nfe.0.stats.rx.extra_bytes: 0
dev.nfe.0.stats.rx.late_cols: 0
dev.nfe.0.stats.rx.runts: 0
dev.nfe.0.stats.rx.jumbos: 0
dev.nfe.0.stats.rx.fifo_overuns: 0
dev.nfe.0.stats.rx.crc_errors: 0
dev.nfe.0.stats.rx.fae: 0
dev.nfe.0.stats.rx.len_errors: 0
dev.nfe.0.stats.rx.unicast: 0
dev.nfe.0.stats.rx.multicast: 0
dev.nfe.0.stats.rx.broadcast: 0
dev.nfe.0.stats.rx.octets: 53488472383
dev.nfe.0.stats.rx.pause: 0
dev.nfe.0.stats.rx.drops: 0
dev.nfe.0.stats.tx.octets: 7480691065
dev.nfe.0.stats.tx.zero_rexmits: 68040095
dev.nfe.0.stats.tx.one_rexmits: 0
dev.nfe.0.stats.tx.multi_rexmits: 0
dev.nfe.0.stats.tx.late_cols: 0
dev.nfe.0.stats.tx.fifo_underuns: 0
dev.nfe.0.stats.tx.carrier_losts: 0
dev.nfe.0.stats.tx.excess_deferrals: 0
dev.nfe.0.stats.tx.retry_errors: 0
dev.nfe.0.stats.tx.deferrals: 0
dev.nfe.0.stats.tx.frames: 68040095
dev.nfe.0.stats.tx.pause: 0

After:
M [root@X:~]# sysctl dev.nfe.0.stats
dev.nfe.0.stats.rx.frame_errors: 0
dev.nfe.0.stats.rx.extra_bytes: 0
dev.nfe.0.stats.rx.late_cols: 0
dev.nfe.0.stats.rx.runts: 0
dev.nfe.0.stats.rx.jumbos: 0
dev.nfe.0.stats.rx.fifo_overuns: 0
dev.nfe.0.stats.rx.crc_errors: 0
dev.nfe.0.stats.rx.fae: 0
dev.nfe.0.stats.rx.len_errors: 0
dev.nfe.0.stats.rx.unicast: 0
dev.nfe.0.stats.rx.multicast: 0
dev.nfe.0.stats.rx.broadcast: 0
dev.nfe.0.stats.rx.octets: 53488985199
dev.nfe.0.stats.rx.pause: 0
dev.nfe.0.stats.rx.drops: 0
dev.nfe.0.stats.tx.octets: 7480798032
dev.nfe.0.stats.tx.zero_rexmits: 68041016
dev.nfe.0.stats.tx.one_rexmits: 0
dev.nfe.0.stats.tx.multi_rexmits: 0
dev.nfe.0.stats.tx.late_cols: 0
dev.nfe.0.stats.tx.fifo_underuns: 0
dev.nfe.0.stats.tx.carrier_losts: 0
dev.nfe.0.stats.tx.excess_deferrals: 0
dev.nfe.0.stats.tx.retry_errors: 0
dev.nfe.0.stats.tx.deferrals: 0
dev.nfe.0.stats.tx.frames: 68041016
dev.nfe.0.stats.tx.pause: 0

M [root@X:~]# sysctl dev.nfe.0.stats
dev.nfe.0.stats.rx.frame_errors: 0
dev.nfe.0.stats.rx.extra_bytes: 0
dev.nfe.0.stats.rx.late_cols: 0
dev.nfe.0.stats.rx.runts: 0
dev.nfe.0.stats.rx.jumbos: 0
dev.nfe.0.stats.rx.fifo_overuns: 0
dev.nfe.0.stats.rx.crc_errors: 0
dev.nfe.0.stats.rx.fae: 0
dev.nfe.0.stats.rx.len_errors: 0
dev.nfe.0.stats.rx.unicast: 0
dev.nfe.0.stats.rx.multicast: 0
dev.nfe.0.stats.rx.broadcast: 0
dev.nfe.0.stats.rx.octets: 53489024128
dev.nfe.0.stats.rx.pause: 0
dev.nfe.0.stats.rx.drops: 0
dev.nfe.0.stats.tx.octets: 7480798032
dev.nfe.0.stats.tx.zero_rexmits: 68041016
dev.nfe.0.stats.tx.one_rexmits: 0
dev.nfe.0.stats.tx.multi_rexmits: 0
dev.nfe.0.stats.tx.late_cols: 0
dev.nfe.0.stats.tx.fifo_underuns: 0
dev.nfe.0.stats.tx.carrier_losts: 0
dev.nfe.0.stats.tx.excess_deferrals: 0
dev.nfe.0.stats.tx.retry_errors: 0
dev.nfe.0.stats.tx.deferrals: 0
dev.nfe.0.stats.tx.frames: 68041016
dev.nfe.0.stats.tx.pause: 0


>Also I'd like to know whether both RX and TX are dead or only one
>RX/TX path is hung. Can you see incoming traffic with tcpdump when
>you think the controller is in stuck?

Yes, though not very much. The traffic to 4800 is every second so you =
can see in the following trace when it stops

07:10:42.287163 IP 192.168.1.203 > 224.0.0.240:  pfsync 108
07:10:42.911995
07:10:43.112073 STP 802.1d, Config, Flags [Topology change], bridge-id =
8000.c4:7d:4f:a9:ac:30.8008, length 43
07:10:43.148659 IP 192.168.1.203.57026 > 192.168.1.255.4800: UDP, length =
60
07:10:43.148684 IP 172.31.1.203 > 172.31.1.129: GREv0, length 92: IP =
192.168.1.203.57026 > 192.168.1.129.4800: UDP, length 60
07:10:43.148689 IP 172.31.1.203 > 172.31.1.129: GREv0, length 92: IP =
192.168.1.203.57026 > 192.168.1.1.4800: UDP, length 60
07:10:43.148918 IP 192.168.1.213.40677 > 192.168.1.255.4800: UDP, length =
48
07:10:43.329272 IP 192.168.1.203 > 224.0.0.240:  pfsync 108
07:10:44.151762 IP 192.168.1.203.57026 > 192.168.1.255.4800: UDP, length =
60
07:10:44.151787 IP 172.31.1.203 > 172.31.1.129: GREv0, length 92: IP =
192.168.1.203.57026 > 192.168.1.129.4800: UDP, length 60
07:10:44.151791 IP 172.31.1.203 > 172.31.1.129: GREv0, length 92: IP =
192.168.1.203.57026 > 192.168.1.1.4800: UDP, length 60
07:10:44.152043 IP 192.168.1.213.40677 > 192.168.1.255.4800: UDP, length =
48
07:10:44.371387 IP 192.168.1.203 > 224.0.0.240:  pfsync 108
07:10:45.154868 IP 192.168.1.203.57026 > 192.168.1.255.4800: UDP, length =
60
07:10:45.154892 IP 172.31.1.203 > 172.31.1.129: GREv0, length 92: IP =
192.168.1.203.57026 > 192.168.1.129.4800: UDP, length 60
07:10:45.154896 IP 172.31.1.203 > 172.31.1.129: GREv0, length 92: IP =
192.168.1.203.57026 > 192.168.1.1.4800: UDP, length 60
07:10:45.155146 IP 192.168.1.213.40677 > 192.168.1.255.4800: UDP, length =
48
07:10:45.413495 IP 192.168.1.203 > 224.0.0.240:  pfsync 108
07:10:45.513130 IP 192.168.1.213 > 224.0.0.240:  pfsync 196
07:10:46.157973 IP 192.168.1.203.57026 > 192.168.1.255.4800: UDP, length =
60
07:10:46.157997 IP 172.31.1.203 > 172.31.1.129: GREv0, length 92: IP =
192.168.1.203.57026 > 192.168.1.129.4800: UDP, length 60
07:10:46.158002 IP 172.31.1.203 > 172.31.1.129: GREv0, length 92: IP =
192.168.1.203.57026 > 192.168.1.1.4800: UDP, length 60
07:10:46.158222 IP 192.168.1.213.40677 > 192.168.1.255.4800: UDP, length =
48
07:10:46.455622 IP 192.168.1.203 > 224.0.0.240:  pfsync 108
07:10:47.497702 IP 192.168.1.203 > 224.0.0.240:  pfsync 108
07:10:48.108787 STP 802.1d, Config, Flags [Topology change], bridge-id =
8000.c4:7d:4f:a9:ac:30.8008, length 43
07:10:48.150038 IP 192.168.1.213 > 224.0.0.240:  pfsync 108
07:10:48.539811 IP 192.168.1.203 > 224.0.0.240:  pfsync 108
07:10:49.581924 IP 192.168.1.203 > 224.0.0.240:  pfsync 108
07:10:49.869585
07:10:50.154224 IP 192.168.1.213 > 224.0.0.240:  pfsync 108
07:10:50.624031 IP 192.168.1.203 > 224.0.0.240:  pfsync 108
07:10:51.666137 IP 192.168.1.203 > 224.0.0.240:  pfsync 108
07:10:52.156425 IP 192.168.1.213 > 224.0.0.240:  pfsync 108
07:10:52.708257 IP 192.168.1.203 > 224.0.0.240:  pfsync 108
07:10:53.105667 STP 802.1d, Config, Flags [Topology change], bridge-id =
8000.c4:7d:4f:a9:ac:30.8008, length 43
07:10:53.712615 IP 192.168.1.213 > 224.0.0.240:  pfsync 284
07:10:53.750351 IP 192.168.1.203 > 224.0.0.240:  pfsync 108
07:10:54.792460 IP 192.168.1.203 > 224.0.0.240:  pfsync 108
07:10:54.996715 IP 192.168.1.213 > 224.0.0.240:  pfsync 36
07:10:55.160134 IP 192.168.1.213 > 224.0.0.240:  pfsync 68
07:10:55.834569 IP 192.168.1.203 > 224.0.0.240:  pfsync 108
07:10:56.160794 IP 192.168.1.213 > 224.0.0.240:  pfsync 108
07:10:56.876680 IP 192.168.1.203 > 224.0.0.240:  pfsync 108
07:10:57.918791 IP 192.168.1.203 > 224.0.0.240:  pfsync 108

a bit later on, still broken, a slight odd message:
07:11:43.079720 IP 172.31.1.129 > 172.31.1.213: GREv0, length 52: IP =
192.168.1.129.60446 > 192.168.1.213.179:  tcp 12 [bad hdr length 16 - =
too short, < 20]
07:11:44.210794 IP 172.31.1.129 > 172.31.1.203: GREv0, length 84: IP =
192.168.1.129.64744 > 192.168.1.203.4800: UDP, length 52
07:11:44.210831 IP 172.31.1.129 > 172.31.1.213: GREv0, length 84: IP =
192.168.1.129.64744 > 192.168.1.213.4800: UDP, length 52

Now this really is odd, I don't recognise either of those MAC addresses, =
though the SQL shown is used on this machine (
07:12:13.054393 45:43:54:20:41:63 > 00:00:03:53:45:4c, ethertype Unknown =
(0x6374), length 60:
        0x0000:  556e 6971 7565 4964 2046 524f 4d20 7261  =
UniqueId.FROM.ra
        0x0010:  6461 6363 7420 2057 4845 5245 2043 616c  =
dacct..WHERE.Cal
        0x0020:  6c69 6e67 5374 6174 696f 6e49 6420       lingStationId.
07:12:13.229942 ARP, Request who-has 172.31.1.129 tell 172.31.1.213, =
length 28
07:12:13.229969 ARP, Request who-has 172.31.1.129 tell 172.31.1.213, =
length 28
07:12:14.600212 IP 172.31.1.213 > 224.0.0.18: VRRPv2, Advertisement, =
vrid 1, prio 240, authtype none, intvl 1s, length 36
07:12:14.600234 ARP, Request who-has 172.31.1.193 tell 172.31.1.193, =
length 28
07:12:15.643450 IP 172.31.1.213 > 224.0.0.18: VRRPv2, Advertisement, =
vrid 1, prio 240, authtype none, intvl 1s, length 36
07:12:16.261215 ARP, Request who-has 172.31.1.129 tell 172.31.1.213, =
length 28
07:12:17.583475 IP 172.31.1.213 > 224.0.0.18: VRRPv2, Advertisement, =
vrid 1, prio 240, authtype none, intvl 1s, length 36
07:12:18.051111 41:54:45:20:74:72 > 00:00:03:55:50:44, ethertype Unknown =
(0x616e), length 60:
        0x0000:  7369 656e 742e 7261 6469 7070 6f6f 6c20  =
sient.radippool.
        0x0010:  5345 5420 6e61 7369 7061 6464 7265 7373  =
SET.nasipaddress
        0x0020:  203d 2027 3137 322e 3331 2e33 2e31       .=3D.'172.30.0.1=



I then restarted the interface (nfe down/up, route restart)

=46rom dmesg at the time (slight obfuscated)
Sep  3 07:10:19 manch2 bgpd[89612]: neighbor XX: received notification: =
HoldTimer expired, unknown subcode 0
Sep  3 07:10:49 manch2 bgpd[89612]: neighbor XX connect: Host is down
# at this point I took the interface down & up and reloaded the routing =
tables
Sep  3 07:12:07 manch2 kernel: carp0: link state changed to DOWN
Sep  3 07:12:07 manch2 kernel: carp0: link state changed to DOWN
Sep  3 07:12:07 manch2 kernel: nfe0: link state changed to DOWN
Sep  3 07:12:07 manch2 kernel: carp0: link state changed to DOWN
Sep  3 07:12:11 manch2 kernel: nfe0: link state changed to UP  =20
Sep  3 07:12:11 manch2 kernel: carp0: link state changed to DOWN
Sep  3 07:12:14 manch2 kernel: carp0: link state changed to UP
Sep  3 07:13:09 manch2 bgpd[89612]: neighbor YYY: socket error: =
Operation timed out

And a while later

M [root@manch2:~]# sysctl dev.nfe.0.stats
dev.nfe.0.stats.rx.frame_errors: 0
dev.nfe.0.stats.rx.extra_bytes: 0
dev.nfe.0.stats.rx.late_cols: 0
dev.nfe.0.stats.rx.runts: 0
dev.nfe.0.stats.rx.jumbos: 0
dev.nfe.0.stats.rx.fifo_overuns: 0
dev.nfe.0.stats.rx.crc_errors: 0
dev.nfe.0.stats.rx.fae: 0
dev.nfe.0.stats.rx.len_errors: 0
dev.nfe.0.stats.rx.unicast: 0
dev.nfe.0.stats.rx.multicast: 0
dev.nfe.0.stats.rx.broadcast: 0
dev.nfe.0.stats.rx.octets: 53502234479
dev.nfe.0.stats.rx.pause: 0
dev.nfe.0.stats.rx.drops: 0
dev.nfe.0.stats.tx.octets: 7482784573
dev.nfe.0.stats.tx.zero_rexmits: 68060063
dev.nfe.0.stats.tx.one_rexmits: 0
dev.nfe.0.stats.tx.multi_rexmits: 0
dev.nfe.0.stats.tx.late_cols: 0
dev.nfe.0.stats.tx.fifo_underuns: 0
dev.nfe.0.stats.tx.carrier_losts: 0
dev.nfe.0.stats.tx.excess_deferrals: 0
dev.nfe.0.stats.tx.retry_errors: 0
dev.nfe.0.stats.tx.deferrals: 0
dev.nfe.0.stats.tx.frames: 68060063
dev.nfe.0.stats.tx.pause: 0


>By chance, are you using polling(4) or VLAN on nfe(4)?

No

Also, I tried turning TSO off and the same behaviour happened though =
over a slightly longer period of time (could be co-incidence)

(with TSO off)
nfe0: flags=3D8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> =
metric 0 mtu 1500
        options=3D9b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM>
        ether 00:26:9e:9b:a6:14
        inet 172.31.1.213 netmask 0xffffff80 broadcast 172.31.1.255
        media: Ethernet autoselect (1000baseT <full-duplex>)
        status: active


Mel

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?C536F7C7-A1EA-4BDA-8F5F-E6A5919F6D9A>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation