From owner-freebsd-net@FreeBSD.ORG  Tue Feb  1 21:19:10 2011
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id A299E106566B
	for <freebsd-net@freebsd.org>; Tue,  1 Feb 2011 21:19:10 +0000 (UTC)
	(envelope-from carlson39@llnl.gov)
Received: from smtp.llnl.gov (nspiron-3.llnl.gov [128.115.41.83])
	by mx1.freebsd.org (Postfix) with ESMTP id 912D48FC18
	for <freebsd-net@freebsd.org>; Tue,  1 Feb 2011 21:19:10 +0000 (UTC)
X-Attachments: None
Received: from bagua.llnl.gov (HELO [134.9.197.135]) ([134.9.197.135])
	by smtp.llnl.gov with ESMTP; 01 Feb 2011 12:50:36 -0800
Message-ID: <4D48721A.5040906@llnl.gov>
Date: Tue, 01 Feb 2011 12:50:34 -0800
From: Mike Carlson <carlson39@llnl.gov>
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US;
	rv:1.9.2.13) Gecko/20101208 Lightning/1.0b2 Thunderbird/3.1.7
MIME-Version: 1.0
To: freebsd-net@freebsd.org
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: A flood of bacula traffic causes igb interface to go offline.
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Feb 2011 21:19:10 -0000

Hey net@,

I have a FreeBSD 8.2-RC2 system running on a HP DL180 G6, using the 
onboard Intel controller, and it is our primary Bacula storage node and 
director node.

We have 96 clients that are scheduled to run at 8:30pm. After about 9 - 
10 minutes of activity (mrtg graphs show about 50-60MB/sec incoming 
traffic), the igb1 interface is no longer able to communicate with the 
Cisco switch.

The interesting part is, the interface is still "up", there is nothing 
in the kernel message buffer, and nothing relevant in the log file (just 
syslogd and ldap errors because they cannot reach their respective 
network servers). The system only responds to the network until I either 
reboot, or run 'ifconfig igb1 down ;  ifconfig igb1 up'. There is no 
firewall loaded/configured.

Thankfully, I have a KVM over IP, so when this happens I can at least 
run script(1) and capture some useful information.
ifconfig igb1
igb1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
     
options=1bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4>
     ether 1c:c1:de:e9:fb:af
     inet 128.15.136.105 netmask 0xffffff00 broadcast 128.15.136.255
     inet 128.15.136.108 netmask 0xffffff00 broadcast 128.15.136.255
     inet 128.15.136.102 netmask 0xffffff00 broadcast 128.15.136.255
     media: Ethernet autoselect (1000baseT <full-duplex>)
     status: active

I can ping the internal IP (but I realize that is probably a useless
test...)
root@write /etc]> ping 128.15.136.105
PING 128.15.136.105 (128.15.136.105): 56 data bytes
64 bytes from 128.15.136.105: icmp_seq=0 ttl=64 time=0.024 ms
64 bytes from 128.15.136.105: icmp_seq=1 ttl=64 time=0.015 ms
^C
--- 128.15.136.105 ping statistics ---
2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.015/0.019/0.024/0.005 ms

Attempting to ping the router:
root@write /etc]> ping 128.15.136.254
PING 128.15.136.254 (128.15.136.254): 56 data bytes
ping: sendto: Host is down
ping: sendto: Host is down
ping: sendto: Host is down
ping: sendto: Host is down
^C
--- 128.15.136.254 ping statistics ---
9 packets transmitted, 0 packets received, 100.0% packet loss


The only thing that seems to solve this problem is to either reboot, or
do an "ifconfig down/up":

root@write /etc]> ifconfig igb1 down
root@write /etc]> ifconfig igb1
root@write /etc]> ping 128.15.136.254
PING 128.15.136.254 (128.15.136.254): 56 data bytes
64 bytes from 128.15.136.254: icmp_seq=1 ttl=255 time=1.015 ms
64 bytes from 128.15.136.254: icmp_seq=2 ttl=255 time=0.217 ms
64 bytes from 128.15.136.254: icmp_seq=3 ttl=255 time=0.278 ms
64 bytes from 128.15.136.254: icmp_seq=4 ttl=255 time=0.238 ms
^C
--- 128.15.136.254 ping statistics ---
5 packets transmitted, 4 packets received, 20.0% packet loss
round-trip min/avg/max/stddev = 0.217/0.437/1.015/0.334 ms

I was able to run tcpdump during all of this, and it *nothing* between 
the system and the switch until I run ifconfig igb1 down/up, and then 
you see the CDP and Tree Spanning traffic.

The networking team here has told me there are no errors on the switch, 
or the port I am on, and they even moved me from one port to another, 
but this is still happening on a fairly regular basis now that I've 
added more backup clients.

Is this a possible bug with my hardware and the intel driver? I have a 
pcap file and more system information that might provide a lot more 
information, but I don't want to send that out to a mailing list.