From owner-freebsd-current@FreeBSD.ORG  Mon Jan 24 09:54:11 2005
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 3A8B816A4CE
	for <freebsd-current@freebsd.org>;
	Mon, 24 Jan 2005 09:54:11 +0000 (GMT)
Received: from fledge.watson.org (fledge.watson.org [204.156.12.50])
	by mx1.FreeBSD.org (Postfix) with ESMTP id A8FE943D1D
	for <freebsd-current@freebsd.org>;
	Mon, 24 Jan 2005 09:54:10 +0000 (GMT)
	(envelope-from robert@fledge.watson.org)
Received: from fledge.watson.org (localhost [127.0.0.1])
	by fledge.watson.org (8.13.1/8.13.1) with ESMTP id j0O9rjqV063487;
	Mon, 24 Jan 2005 04:53:45 -0500 (EST)
	(envelope-from robert@fledge.watson.org)
Received: from localhost (robert@localhost)j0O9riM2063484;
	Mon, 24 Jan 2005 09:53:45 GMT
	(envelope-from robert@fledge.watson.org)
Date: Mon, 24 Jan 2005 09:53:44 +0000 (GMT)
From: Robert Watson <rwatson@freebsd.org>
X-Sender: robert@fledge.watson.org
To: Ganbold <ganbold@micom.mng.net>
In-Reply-To: <6.2.0.14.2.20050124113106.03402770@202.179.0.80>
Message-ID: <Pine.NEB.3.96L.1050124093419.63183A-100000@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
cc: freebsd-current@freebsd.org
Subject: Re: fxp0: device timed out problem
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 24 Jan 2005 09:54:11 -0000

On Mon, 24 Jan 2005, Ganbold wrote:

> > > I turned off debug.mpsafenet to 0 and it seems like problem goes away.
> ><snip>
> > > Is this problem related to network stack? Or is it related to fxp driver?
> >
> >It's most likely a problem with the device driver or interrupt
> >configuration on your system.  There are a couple of other variables you
> >might try frobbing:
> >
> >- Use of ACPI to configure the hardware
> >- Use of "device apic" if the system is non-SMP
> 
> I see. One of the server is SMP system and device apic option is used in
> kernel config file. I didn't try device apic on non-SMP machine. 

Any luck with disabling ACPI?  In particular, are the interrupt
assignments substantially different between booting with ACPI and without?
You can probably just diff -u the old dmesg.boot and the new one...

> >Usually a device timed out error is related to interrupts from the device
> >not being delivered, being delivered improperly, etc.  Does your dmesg
> >contain any references to interrupt storms?  Once the above message has
> >printed, do you see any further interrupts on the fxp interrupt source
> >when checking intermittently with "systat -vmstat 1" or "vmstat -i"?
> 
> I couldn't check the system by issuing those commands.  Following is the
> dmesg output with debug.mpsafenet disabled: 

Couldn't as in, not possible for administrative reasons, because you
couldn't log in once the failure occurred so couldn't get the output, or
because they don't work, or...?  Just want to make sure I understand if
this is an administrative issue or symptomatic.

> I didn't do much investigation on those servers that time. However
> without debug.mpsafenet, servers are working fine for more than 3 weeks. 

That is certainly suggestive -- I wonder if we're looking at a locking bug
in fxp0 involving serialization with the hardware.  However, it's not
conclusive, I think -- when running MPSAFE, the timing is quite different
on UP as well as SMP hardware, which could trigger other existing bugs. 
The big open question, I think, is whether an interrupt delivery problem
is involved. 

Robert N M Watson