Date: Fri, 3 Sep 2010 07:59:26 +0100 From: Melissa Jenkins <melissa-freebsd@littlebluecar.co.uk> To: freebsd-net@freebsd.org Subject: Re: NFE adapter 'hangs' Message-ID: <C536F7C7-A1EA-4BDA-8F5F-E6A5919F6D9A@littlebluecar.co.uk> In-Reply-To: <20100902194940.GH21940@michelle.cdnetworks.com> References: <5C261F16-6530-47EE-B1C1-BA38CD6D8B01@littlebluecar.co.uk> <20100902194940.GH21940@michelle.cdnetworks.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Thank you for your very quick response :) >First of all, if you have 'route-to' rules in pf please disable TSO >on nfe(4). This is a known pf(4) issue for handling TSO frames. There is route-to in the PF ruleset but it is not evaluated, and I've = confirmed the rules is not matched @0 pass in on nfe0 route-to (gif1000 X) inet from X/26 to ! = <internalnets> flags S/SA keep state [ Evaluations: 126538 Packets: 0 Bytes: 0 States: = 0 ] [ Inserted: uid 0 pid 79327 ] >If this is not the case, would you show me the output of >"sysctl dev.nfe.0.stats"? Can you see these statistics are still >updated at the time of hang? Before: M [root@X:~]# sysctl dev.nfe.0.stats dev.nfe.0.stats.rx.frame_errors: 0 dev.nfe.0.stats.rx.extra_bytes: 0 dev.nfe.0.stats.rx.late_cols: 0 dev.nfe.0.stats.rx.runts: 0 dev.nfe.0.stats.rx.jumbos: 0 dev.nfe.0.stats.rx.fifo_overuns: 0 dev.nfe.0.stats.rx.crc_errors: 0 dev.nfe.0.stats.rx.fae: 0 dev.nfe.0.stats.rx.len_errors: 0 dev.nfe.0.stats.rx.unicast: 0 dev.nfe.0.stats.rx.multicast: 0 dev.nfe.0.stats.rx.broadcast: 0 dev.nfe.0.stats.rx.octets: 53488472383 dev.nfe.0.stats.rx.pause: 0 dev.nfe.0.stats.rx.drops: 0 dev.nfe.0.stats.tx.octets: 7480691065 dev.nfe.0.stats.tx.zero_rexmits: 68040095 dev.nfe.0.stats.tx.one_rexmits: 0 dev.nfe.0.stats.tx.multi_rexmits: 0 dev.nfe.0.stats.tx.late_cols: 0 dev.nfe.0.stats.tx.fifo_underuns: 0 dev.nfe.0.stats.tx.carrier_losts: 0 dev.nfe.0.stats.tx.excess_deferrals: 0 dev.nfe.0.stats.tx.retry_errors: 0 dev.nfe.0.stats.tx.deferrals: 0 dev.nfe.0.stats.tx.frames: 68040095 dev.nfe.0.stats.tx.pause: 0 After: M [root@X:~]# sysctl dev.nfe.0.stats dev.nfe.0.stats.rx.frame_errors: 0 dev.nfe.0.stats.rx.extra_bytes: 0 dev.nfe.0.stats.rx.late_cols: 0 dev.nfe.0.stats.rx.runts: 0 dev.nfe.0.stats.rx.jumbos: 0 dev.nfe.0.stats.rx.fifo_overuns: 0 dev.nfe.0.stats.rx.crc_errors: 0 dev.nfe.0.stats.rx.fae: 0 dev.nfe.0.stats.rx.len_errors: 0 dev.nfe.0.stats.rx.unicast: 0 dev.nfe.0.stats.rx.multicast: 0 dev.nfe.0.stats.rx.broadcast: 0 dev.nfe.0.stats.rx.octets: 53488985199 dev.nfe.0.stats.rx.pause: 0 dev.nfe.0.stats.rx.drops: 0 dev.nfe.0.stats.tx.octets: 7480798032 dev.nfe.0.stats.tx.zero_rexmits: 68041016 dev.nfe.0.stats.tx.one_rexmits: 0 dev.nfe.0.stats.tx.multi_rexmits: 0 dev.nfe.0.stats.tx.late_cols: 0 dev.nfe.0.stats.tx.fifo_underuns: 0 dev.nfe.0.stats.tx.carrier_losts: 0 dev.nfe.0.stats.tx.excess_deferrals: 0 dev.nfe.0.stats.tx.retry_errors: 0 dev.nfe.0.stats.tx.deferrals: 0 dev.nfe.0.stats.tx.frames: 68041016 dev.nfe.0.stats.tx.pause: 0 M [root@X:~]# sysctl dev.nfe.0.stats dev.nfe.0.stats.rx.frame_errors: 0 dev.nfe.0.stats.rx.extra_bytes: 0 dev.nfe.0.stats.rx.late_cols: 0 dev.nfe.0.stats.rx.runts: 0 dev.nfe.0.stats.rx.jumbos: 0 dev.nfe.0.stats.rx.fifo_overuns: 0 dev.nfe.0.stats.rx.crc_errors: 0 dev.nfe.0.stats.rx.fae: 0 dev.nfe.0.stats.rx.len_errors: 0 dev.nfe.0.stats.rx.unicast: 0 dev.nfe.0.stats.rx.multicast: 0 dev.nfe.0.stats.rx.broadcast: 0 dev.nfe.0.stats.rx.octets: 53489024128 dev.nfe.0.stats.rx.pause: 0 dev.nfe.0.stats.rx.drops: 0 dev.nfe.0.stats.tx.octets: 7480798032 dev.nfe.0.stats.tx.zero_rexmits: 68041016 dev.nfe.0.stats.tx.one_rexmits: 0 dev.nfe.0.stats.tx.multi_rexmits: 0 dev.nfe.0.stats.tx.late_cols: 0 dev.nfe.0.stats.tx.fifo_underuns: 0 dev.nfe.0.stats.tx.carrier_losts: 0 dev.nfe.0.stats.tx.excess_deferrals: 0 dev.nfe.0.stats.tx.retry_errors: 0 dev.nfe.0.stats.tx.deferrals: 0 dev.nfe.0.stats.tx.frames: 68041016 dev.nfe.0.stats.tx.pause: 0 >Also I'd like to know whether both RX and TX are dead or only one >RX/TX path is hung. Can you see incoming traffic with tcpdump when >you think the controller is in stuck? Yes, though not very much. The traffic to 4800 is every second so you = can see in the following trace when it stops 07:10:42.287163 IP 192.168.1.203 > 224.0.0.240: pfsync 108 07:10:42.911995 07:10:43.112073 STP 802.1d, Config, Flags [Topology change], bridge-id = 8000.c4:7d:4f:a9:ac:30.8008, length 43 07:10:43.148659 IP 192.168.1.203.57026 > 192.168.1.255.4800: UDP, length = 60 07:10:43.148684 IP 172.31.1.203 > 172.31.1.129: GREv0, length 92: IP = 192.168.1.203.57026 > 192.168.1.129.4800: UDP, length 60 07:10:43.148689 IP 172.31.1.203 > 172.31.1.129: GREv0, length 92: IP = 192.168.1.203.57026 > 192.168.1.1.4800: UDP, length 60 07:10:43.148918 IP 192.168.1.213.40677 > 192.168.1.255.4800: UDP, length = 48 07:10:43.329272 IP 192.168.1.203 > 224.0.0.240: pfsync 108 07:10:44.151762 IP 192.168.1.203.57026 > 192.168.1.255.4800: UDP, length = 60 07:10:44.151787 IP 172.31.1.203 > 172.31.1.129: GREv0, length 92: IP = 192.168.1.203.57026 > 192.168.1.129.4800: UDP, length 60 07:10:44.151791 IP 172.31.1.203 > 172.31.1.129: GREv0, length 92: IP = 192.168.1.203.57026 > 192.168.1.1.4800: UDP, length 60 07:10:44.152043 IP 192.168.1.213.40677 > 192.168.1.255.4800: UDP, length = 48 07:10:44.371387 IP 192.168.1.203 > 224.0.0.240: pfsync 108 07:10:45.154868 IP 192.168.1.203.57026 > 192.168.1.255.4800: UDP, length = 60 07:10:45.154892 IP 172.31.1.203 > 172.31.1.129: GREv0, length 92: IP = 192.168.1.203.57026 > 192.168.1.129.4800: UDP, length 60 07:10:45.154896 IP 172.31.1.203 > 172.31.1.129: GREv0, length 92: IP = 192.168.1.203.57026 > 192.168.1.1.4800: UDP, length 60 07:10:45.155146 IP 192.168.1.213.40677 > 192.168.1.255.4800: UDP, length = 48 07:10:45.413495 IP 192.168.1.203 > 224.0.0.240: pfsync 108 07:10:45.513130 IP 192.168.1.213 > 224.0.0.240: pfsync 196 07:10:46.157973 IP 192.168.1.203.57026 > 192.168.1.255.4800: UDP, length = 60 07:10:46.157997 IP 172.31.1.203 > 172.31.1.129: GREv0, length 92: IP = 192.168.1.203.57026 > 192.168.1.129.4800: UDP, length 60 07:10:46.158002 IP 172.31.1.203 > 172.31.1.129: GREv0, length 92: IP = 192.168.1.203.57026 > 192.168.1.1.4800: UDP, length 60 07:10:46.158222 IP 192.168.1.213.40677 > 192.168.1.255.4800: UDP, length = 48 07:10:46.455622 IP 192.168.1.203 > 224.0.0.240: pfsync 108 07:10:47.497702 IP 192.168.1.203 > 224.0.0.240: pfsync 108 07:10:48.108787 STP 802.1d, Config, Flags [Topology change], bridge-id = 8000.c4:7d:4f:a9:ac:30.8008, length 43 07:10:48.150038 IP 192.168.1.213 > 224.0.0.240: pfsync 108 07:10:48.539811 IP 192.168.1.203 > 224.0.0.240: pfsync 108 07:10:49.581924 IP 192.168.1.203 > 224.0.0.240: pfsync 108 07:10:49.869585 07:10:50.154224 IP 192.168.1.213 > 224.0.0.240: pfsync 108 07:10:50.624031 IP 192.168.1.203 > 224.0.0.240: pfsync 108 07:10:51.666137 IP 192.168.1.203 > 224.0.0.240: pfsync 108 07:10:52.156425 IP 192.168.1.213 > 224.0.0.240: pfsync 108 07:10:52.708257 IP 192.168.1.203 > 224.0.0.240: pfsync 108 07:10:53.105667 STP 802.1d, Config, Flags [Topology change], bridge-id = 8000.c4:7d:4f:a9:ac:30.8008, length 43 07:10:53.712615 IP 192.168.1.213 > 224.0.0.240: pfsync 284 07:10:53.750351 IP 192.168.1.203 > 224.0.0.240: pfsync 108 07:10:54.792460 IP 192.168.1.203 > 224.0.0.240: pfsync 108 07:10:54.996715 IP 192.168.1.213 > 224.0.0.240: pfsync 36 07:10:55.160134 IP 192.168.1.213 > 224.0.0.240: pfsync 68 07:10:55.834569 IP 192.168.1.203 > 224.0.0.240: pfsync 108 07:10:56.160794 IP 192.168.1.213 > 224.0.0.240: pfsync 108 07:10:56.876680 IP 192.168.1.203 > 224.0.0.240: pfsync 108 07:10:57.918791 IP 192.168.1.203 > 224.0.0.240: pfsync 108 a bit later on, still broken, a slight odd message: 07:11:43.079720 IP 172.31.1.129 > 172.31.1.213: GREv0, length 52: IP = 192.168.1.129.60446 > 192.168.1.213.179: tcp 12 [bad hdr length 16 - = too short, < 20] 07:11:44.210794 IP 172.31.1.129 > 172.31.1.203: GREv0, length 84: IP = 192.168.1.129.64744 > 192.168.1.203.4800: UDP, length 52 07:11:44.210831 IP 172.31.1.129 > 172.31.1.213: GREv0, length 84: IP = 192.168.1.129.64744 > 192.168.1.213.4800: UDP, length 52 Now this really is odd, I don't recognise either of those MAC addresses, = though the SQL shown is used on this machine ( 07:12:13.054393 45:43:54:20:41:63 > 00:00:03:53:45:4c, ethertype Unknown = (0x6374), length 60: 0x0000: 556e 6971 7565 4964 2046 524f 4d20 7261 = UniqueId.FROM.ra 0x0010: 6461 6363 7420 2057 4845 5245 2043 616c = dacct..WHERE.Cal 0x0020: 6c69 6e67 5374 6174 696f 6e49 6420 lingStationId. 07:12:13.229942 ARP, Request who-has 172.31.1.129 tell 172.31.1.213, = length 28 07:12:13.229969 ARP, Request who-has 172.31.1.129 tell 172.31.1.213, = length 28 07:12:14.600212 IP 172.31.1.213 > 224.0.0.18: VRRPv2, Advertisement, = vrid 1, prio 240, authtype none, intvl 1s, length 36 07:12:14.600234 ARP, Request who-has 172.31.1.193 tell 172.31.1.193, = length 28 07:12:15.643450 IP 172.31.1.213 > 224.0.0.18: VRRPv2, Advertisement, = vrid 1, prio 240, authtype none, intvl 1s, length 36 07:12:16.261215 ARP, Request who-has 172.31.1.129 tell 172.31.1.213, = length 28 07:12:17.583475 IP 172.31.1.213 > 224.0.0.18: VRRPv2, Advertisement, = vrid 1, prio 240, authtype none, intvl 1s, length 36 07:12:18.051111 41:54:45:20:74:72 > 00:00:03:55:50:44, ethertype Unknown = (0x616e), length 60: 0x0000: 7369 656e 742e 7261 6469 7070 6f6f 6c20 = sient.radippool. 0x0010: 5345 5420 6e61 7369 7061 6464 7265 7373 = SET.nasipaddress 0x0020: 203d 2027 3137 322e 3331 2e33 2e31 .=3D.'172.30.0.1= I then restarted the interface (nfe down/up, route restart) =46rom dmesg at the time (slight obfuscated) Sep 3 07:10:19 manch2 bgpd[89612]: neighbor XX: received notification: = HoldTimer expired, unknown subcode 0 Sep 3 07:10:49 manch2 bgpd[89612]: neighbor XX connect: Host is down # at this point I took the interface down & up and reloaded the routing = tables Sep 3 07:12:07 manch2 kernel: carp0: link state changed to DOWN Sep 3 07:12:07 manch2 kernel: carp0: link state changed to DOWN Sep 3 07:12:07 manch2 kernel: nfe0: link state changed to DOWN Sep 3 07:12:07 manch2 kernel: carp0: link state changed to DOWN Sep 3 07:12:11 manch2 kernel: nfe0: link state changed to UP =20 Sep 3 07:12:11 manch2 kernel: carp0: link state changed to DOWN Sep 3 07:12:14 manch2 kernel: carp0: link state changed to UP Sep 3 07:13:09 manch2 bgpd[89612]: neighbor YYY: socket error: = Operation timed out And a while later M [root@manch2:~]# sysctl dev.nfe.0.stats dev.nfe.0.stats.rx.frame_errors: 0 dev.nfe.0.stats.rx.extra_bytes: 0 dev.nfe.0.stats.rx.late_cols: 0 dev.nfe.0.stats.rx.runts: 0 dev.nfe.0.stats.rx.jumbos: 0 dev.nfe.0.stats.rx.fifo_overuns: 0 dev.nfe.0.stats.rx.crc_errors: 0 dev.nfe.0.stats.rx.fae: 0 dev.nfe.0.stats.rx.len_errors: 0 dev.nfe.0.stats.rx.unicast: 0 dev.nfe.0.stats.rx.multicast: 0 dev.nfe.0.stats.rx.broadcast: 0 dev.nfe.0.stats.rx.octets: 53502234479 dev.nfe.0.stats.rx.pause: 0 dev.nfe.0.stats.rx.drops: 0 dev.nfe.0.stats.tx.octets: 7482784573 dev.nfe.0.stats.tx.zero_rexmits: 68060063 dev.nfe.0.stats.tx.one_rexmits: 0 dev.nfe.0.stats.tx.multi_rexmits: 0 dev.nfe.0.stats.tx.late_cols: 0 dev.nfe.0.stats.tx.fifo_underuns: 0 dev.nfe.0.stats.tx.carrier_losts: 0 dev.nfe.0.stats.tx.excess_deferrals: 0 dev.nfe.0.stats.tx.retry_errors: 0 dev.nfe.0.stats.tx.deferrals: 0 dev.nfe.0.stats.tx.frames: 68060063 dev.nfe.0.stats.tx.pause: 0 >By chance, are you using polling(4) or VLAN on nfe(4)? No Also, I tried turning TSO off and the same behaviour happened though = over a slightly longer period of time (could be co-incidence) (with TSO off) nfe0: flags=3D8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> = metric 0 mtu 1500 options=3D9b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM> ether 00:26:9e:9b:a6:14 inet 172.31.1.213 netmask 0xffffff80 broadcast 172.31.1.255 media: Ethernet autoselect (1000baseT <full-duplex>) status: active Mel
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?C536F7C7-A1EA-4BDA-8F5F-E6A5919F6D9A>