From owner-freebsd-net@FreeBSD.ORG  Thu Aug 31 17:14:41 2006
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
X-Original-To: freebsd-net@freebsd.org
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 410D416A4DA
	for <freebsd-net@freebsd.org>; Thu, 31 Aug 2006 17:14:41 +0000 (UTC)
	(envelope-from rob@hudson-trading.com)
Received: from ms-smtp-01.rdc-nyc.rr.com (ms-smtp-01.rdc-nyc.rr.com
	[24.29.109.5]) by mx1.FreeBSD.org (Postfix) with ESMTP id C043243D45
	for <freebsd-net@freebsd.org>; Thu, 31 Aug 2006 17:14:40 +0000 (GMT)
	(envelope-from rob@hudson-trading.com)
Received: from cpe-72-229-120-238.nyc.res.rr.com
	(cpe-72-229-120-238.nyc.res.rr.com [72.229.120.238])
	by ms-smtp-01.rdc-nyc.rr.com (8.13.6/8.13.6) with ESMTP id
	k7VHEdvQ000639
	for <freebsd-net@freebsd.org>; Thu, 31 Aug 2006 13:14:39 -0400 (EDT)
Date: Thu, 31 Aug 2006 13:15:24 -0400 (EDT)
From: Rob Watt <rob@hudson-trading.com>
X-X-Sender: rob@cpe-72-229-120-238.nyc.res.rr.com
To: freebsd-net@freebsd.org
Message-ID: <Pine.OSX.4.64.0608311124590.8120@cpe-72-229-120-238.nyc.res.rr.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Virus-Scanned: Symantec AntiVirus Scan Engine
Subject: Intel em receive hang and possible pr #72970
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 31 Aug 2006 17:14:41 -0000

Hi,

We have experienced a very sporadic problem on 2 amd64 machines running 
FreeBSD 6.0-RELEASE.

The hardware:

  Tyan K8SR motherboard
  2 AMD 275 dual-core processors
  Intel Pro 1000 MT dual-port copper server card
  Intel Pro 1000 MF dual-port fiber server card
  Adaptec 2230S Raid controller

These machines receive multicast & tcp data on multiple interfaces and 
process it & record it to disk and then rebroadcast it on one interface.

Twice now (once on each machine after a recent upgradee to 6.0-RELEASE) 
the 2 fiber em interfaces seemed to stop receiving. Transmits seemed to 
still be happening, and the machine itself was not hung. We could 
console into it and do anything not network related.

The first time this happened we opted to quickly disconnect the machine 
from the network and move its processes to a backup machine. We did not 
see anything interesting with netstat, vmstat, logs, etc (I do not 
remember however which exact tests I ran at the time). Everything seemed 
normal except that it was not receiving on the 2 fiber interfaces (we did 
not actually test the other interfaces, but one of our apps that uses the 
copper interfaces was still receiving data). We rebooted the machine and 
ran Intel's nic diagnostics. The card passed all of the tests through like 
100 iterations.

We eventually put the machine back into production. The second machine had 
the same problem. Unfortunately I was on vacation when it happened and did 
get to do any diagnostics. The developers just put the backup machine 
into production and rebooted the one with the problem.

After poking around in various group/pr postings the most similar problem 
that we found was PR #72970.
  http://www.freebsd.org/cgi/query-pr.cgi?pr=72970

Does it seem that we are encountering that bug? Is that bug fixed in 
6.1-RELEASE, or is there an easy patch to 6.0-RELEASE (i.e. can we only 
patch the em driver).

If it does not seem that we are triggering that bug, does anyone have any 
thoughts about what the problem could be?

We have done fairly intense stress testing in the past on these machines 
with tons of network/disk/cpu/memory activity all happening at the same 
time, and we've never encountered this bug. The fact that it is not easily 
repeatable makes it hard to test for. Any testing suggestions would also 
be appreciated.

Thanks
-
Rob Watt