From owner-freebsd-current@FreeBSD.ORG Mon Jan 24 09:54:11 2005 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3A8B816A4CE for ; Mon, 24 Jan 2005 09:54:11 +0000 (GMT) Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id A8FE943D1D for ; Mon, 24 Jan 2005 09:54:10 +0000 (GMT) (envelope-from robert@fledge.watson.org) Received: from fledge.watson.org (localhost [127.0.0.1]) by fledge.watson.org (8.13.1/8.13.1) with ESMTP id j0O9rjqV063487; Mon, 24 Jan 2005 04:53:45 -0500 (EST) (envelope-from robert@fledge.watson.org) Received: from localhost (robert@localhost)j0O9riM2063484; Mon, 24 Jan 2005 09:53:45 GMT (envelope-from robert@fledge.watson.org) Date: Mon, 24 Jan 2005 09:53:44 +0000 (GMT) From: Robert Watson X-Sender: robert@fledge.watson.org To: Ganbold In-Reply-To: <6.2.0.14.2.20050124113106.03402770@202.179.0.80> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: freebsd-current@freebsd.org Subject: Re: fxp0: device timed out problem X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Jan 2005 09:54:11 -0000 On Mon, 24 Jan 2005, Ganbold wrote: > > > I turned off debug.mpsafenet to 0 and it seems like problem goes away. > > > > > Is this problem related to network stack? Or is it related to fxp driver? > > > >It's most likely a problem with the device driver or interrupt > >configuration on your system. There are a couple of other variables you > >might try frobbing: > > > >- Use of ACPI to configure the hardware > >- Use of "device apic" if the system is non-SMP > > I see. One of the server is SMP system and device apic option is used in > kernel config file. I didn't try device apic on non-SMP machine. Any luck with disabling ACPI? In particular, are the interrupt assignments substantially different between booting with ACPI and without? You can probably just diff -u the old dmesg.boot and the new one... > >Usually a device timed out error is related to interrupts from the device > >not being delivered, being delivered improperly, etc. Does your dmesg > >contain any references to interrupt storms? Once the above message has > >printed, do you see any further interrupts on the fxp interrupt source > >when checking intermittently with "systat -vmstat 1" or "vmstat -i"? > > I couldn't check the system by issuing those commands. Following is the > dmesg output with debug.mpsafenet disabled: Couldn't as in, not possible for administrative reasons, because you couldn't log in once the failure occurred so couldn't get the output, or because they don't work, or...? Just want to make sure I understand if this is an administrative issue or symptomatic. > I didn't do much investigation on those servers that time. However > without debug.mpsafenet, servers are working fine for more than 3 weeks. That is certainly suggestive -- I wonder if we're looking at a locking bug in fxp0 involving serialization with the hardware. However, it's not conclusive, I think -- when running MPSAFE, the timing is quite different on UP as well as SMP hardware, which could trigger other existing bugs. The big open question, I think, is whether an interrupt delivery problem is involved. Robert N M Watson