Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 31 Jan 2006 07:38:48 -0700 (MST)
From:      "M. Warner Losh" <imp@bsdimp.com>
To:        roam@ringlet.net
Cc:        mistry.7@osu.edu, glebius@FreeBSD.org, freebsd-stable@FreeBSD.org
Subject:   Re: dc0: watchdog timeout and nve0: device timeout
Message-ID:  <20060131.073848.39874110.imp@bsdimp.com>
In-Reply-To: <20060131112447.GA1173@straylight.m.ringlet.net>
References:  <20060131091027.CC43516A424@hub.freebsd.org> <20060131083002.GC93773@FreeBSD.org> <20060131112447.GA1173@straylight.m.ringlet.net>

next in thread | previous in thread | raw e-mail | index | archive | help
In message: <20060131112447.GA1173@straylight.m.ringlet.net>
            Peter Pentchev <roam@ringlet.net> writes:
: On Tue, 31 Jan 2006 11:30:02 +0300, Gleb Smirnoff wrote:
: > On Tue, Jan 31, 2006 at 03:08:03AM -0500, Anish Mistry wrote:
: > A> After updating to STABLE today I'm getting the following message with 
: > A> my dc and nve NICs every few seconds.  UP, AMD64.  A kernel from last 
: > A> Thursday was fine.
: > A> 
: > A> dc0: watchdog timeout
: > A> nve0: device timeout (4)
: > 
: > Can you try to backout the code in sys/dev/pci to Thursday? If this
: > doesn't help, you probably need to do a binary search in this small
: > timeframe.
: 
: I think I found the problem - the merge was not quite correct, and
: the PCI interrupt rerouting was disabled for some reason.
: 
: Warner, is there a reason for hiding the "Try to re-route interrupts"
: code behind an apparently "ifdef 0" case?  Well, okay, most probably
: there is a reason, since you've done it, but... it breaks my re0 card
: and it also seems to break Anish's hardware :)

I'm pretty sure that's the problem.  I thought I'd specifically
checked to make sure that I didn't merge this :-(

: BTW, the commit message was not quite correct - rev. 1.302 was not
: really merged, it's included in my patch here.  Also, rev. 1.305 of
: pci.c seems to have more than just adding the PCI_FIND_EXTCAP method -
: there are a couple of offset fixes that I also included in the patch
: while trying to come as close to the -CURRENT code as possible; could
: you check if they actually apply to -STABLE?

They do.

: Anyway, here's a patch that fixes it for me, although most probably
: the __PCI_REROUTE_INTERRUPT chunk should be sufficient.  Warner, if
: you want more details, I could help with debugging this - on my
: system, the re0 card definitely needs this rerouting.  I've posted
: some verbose boot output with explanations at
: http://people.FreeBSD.org/~roam/pcirouting/
: The patch itself is also there in case it gets munged by the mail
: swervers along the way.
: 
: Index: src/sys/dev/pci/pci.c
: ===================================================================
: RCS file: /home/ncvs/src/sys/dev/pci/pci.c,v
: retrieving revision 1.292.2.6
: diff -u -r1.292.2.6 pci.c
: --- src/sys/dev/pci/pci.c	30 Jan 2006 18:42:10 -0000	1.292.2.6
: +++ src/sys/dev/pci/pci.c	31 Jan 2006 10:57:32 -0000
: @@ -428,7 +428,7 @@
:  		ptrptr = PCIR_CAP_PTR;
:  		break;
:  	case 2:
: -		ptrptr = 0x14;
: +		ptrptr = PCIR_CAP_PTR_2;
:  		break;
:  	default:
:  		return;		/* no extended capabilities support */
: @@ -447,10 +447,10 @@
:  		}
:  		/* Find the next entry */
:  		ptr = nextptr;
: -		nextptr = REG(ptr + 1, 1);
: +		nextptr = REG(ptr + PCICAP_NEXTPTR, 1);
:  
:  		/* Process this entry */
: -		switch (REG(ptr, 1)) {
: +		switch (REG(ptr + PCICAP_ID, 1)) {
:  		case PCIY_PMG:		/* PCI power management */
:  			if (cfg->pp.pp_cap == 0) {
:  				cfg->pp.pp_cap = REG(ptr + PCIR_POWER_CAP, 2);
: @@ -1040,7 +1040,8 @@
:  	}
:  
:  	if (cfg->intpin > 0 && PCI_INTERRUPT_VALID(cfg->intline)) {
: -#ifdef __PCI_REROUTE_INTERRUPT
: +#if defined(__ia64__) || defined(__i386__) || defined(__amd64__) || \
: +		defined(__arm__) || defined(__alpha__)
:  		/*
:  		 * Try to re-route interrupts. Sometimes the BIOS or
:  		 * firmware may leave bogus values in these registers.
: 
: Hope this helps!

I'm pretty sure that the REROUTE thing is the only one.  That
shouldn't have been committed, and I thought I'd checked it
specifically before the commit, but I just checked what I committed
and it slipped by.  This fits with the symptoms that I saw my server
last night (the only differences between a stable boot and an older
stable boot was IRQs).

The last part of this patch seems to fix things for me.

Warner



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20060131.073848.39874110.imp>