From owner-freebsd-net@FreeBSD.ORG Tue Feb 1 21:19:10 2011 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A299E106566B for ; Tue, 1 Feb 2011 21:19:10 +0000 (UTC) (envelope-from carlson39@llnl.gov) Received: from smtp.llnl.gov (nspiron-3.llnl.gov [128.115.41.83]) by mx1.freebsd.org (Postfix) with ESMTP id 912D48FC18 for ; Tue, 1 Feb 2011 21:19:10 +0000 (UTC) X-Attachments: None Received: from bagua.llnl.gov (HELO [134.9.197.135]) ([134.9.197.135]) by smtp.llnl.gov with ESMTP; 01 Feb 2011 12:50:36 -0800 Message-ID: <4D48721A.5040906@llnl.gov> Date: Tue, 01 Feb 2011 12:50:34 -0800 From: Mike Carlson User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.13) Gecko/20101208 Lightning/1.0b2 Thunderbird/3.1.7 MIME-Version: 1.0 To: freebsd-net@freebsd.org Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: A flood of bacula traffic causes igb interface to go offline. X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Feb 2011 21:19:10 -0000 Hey net@, I have a FreeBSD 8.2-RC2 system running on a HP DL180 G6, using the onboard Intel controller, and it is our primary Bacula storage node and director node. We have 96 clients that are scheduled to run at 8:30pm. After about 9 - 10 minutes of activity (mrtg graphs show about 50-60MB/sec incoming traffic), the igb1 interface is no longer able to communicate with the Cisco switch. The interesting part is, the interface is still "up", there is nothing in the kernel message buffer, and nothing relevant in the log file (just syslogd and ldap errors because they cannot reach their respective network servers). The system only responds to the network until I either reboot, or run 'ifconfig igb1 down ; ifconfig igb1 up'. There is no firewall loaded/configured. Thankfully, I have a KVM over IP, so when this happens I can at least run script(1) and capture some useful information. ifconfig igb1 igb1: flags=8843 metric 0 mtu 1500 options=1bb ether 1c:c1:de:e9:fb:af inet 128.15.136.105 netmask 0xffffff00 broadcast 128.15.136.255 inet 128.15.136.108 netmask 0xffffff00 broadcast 128.15.136.255 inet 128.15.136.102 netmask 0xffffff00 broadcast 128.15.136.255 media: Ethernet autoselect (1000baseT ) status: active I can ping the internal IP (but I realize that is probably a useless test...) root@write /etc]> ping 128.15.136.105 PING 128.15.136.105 (128.15.136.105): 56 data bytes 64 bytes from 128.15.136.105: icmp_seq=0 ttl=64 time=0.024 ms 64 bytes from 128.15.136.105: icmp_seq=1 ttl=64 time=0.015 ms ^C --- 128.15.136.105 ping statistics --- 2 packets transmitted, 2 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 0.015/0.019/0.024/0.005 ms Attempting to ping the router: root@write /etc]> ping 128.15.136.254 PING 128.15.136.254 (128.15.136.254): 56 data bytes ping: sendto: Host is down ping: sendto: Host is down ping: sendto: Host is down ping: sendto: Host is down ^C --- 128.15.136.254 ping statistics --- 9 packets transmitted, 0 packets received, 100.0% packet loss The only thing that seems to solve this problem is to either reboot, or do an "ifconfig down/up": root@write /etc]> ifconfig igb1 down root@write /etc]> ifconfig igb1 root@write /etc]> ping 128.15.136.254 PING 128.15.136.254 (128.15.136.254): 56 data bytes 64 bytes from 128.15.136.254: icmp_seq=1 ttl=255 time=1.015 ms 64 bytes from 128.15.136.254: icmp_seq=2 ttl=255 time=0.217 ms 64 bytes from 128.15.136.254: icmp_seq=3 ttl=255 time=0.278 ms 64 bytes from 128.15.136.254: icmp_seq=4 ttl=255 time=0.238 ms ^C --- 128.15.136.254 ping statistics --- 5 packets transmitted, 4 packets received, 20.0% packet loss round-trip min/avg/max/stddev = 0.217/0.437/1.015/0.334 ms I was able to run tcpdump during all of this, and it *nothing* between the system and the switch until I run ifconfig igb1 down/up, and then you see the CDP and Tree Spanning traffic. The networking team here has told me there are no errors on the switch, or the port I am on, and they even moved me from one port to another, but this is still happening on a fairly regular basis now that I've added more backup clients. Is this a possible bug with my hardware and the intel driver? I have a pcap file and more system information that might provide a lot more information, but I don't want to send that out to a mailing list.