Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 22 Aug 2006 10:35:21 -0400
From:      John Baldwin <jhb@freebsd.org>
To:        freebsd-alpha@freebsd.org
Cc:        bryanh@meridian-enviro.com, pedersen@meridian-enviro.com
Subject:   Re: Problems with UP2000+
Message-ID:  <200608221035.22244.jhb@freebsd.org>
In-Reply-To: <877j19oe9i.wl%rand@meridian-enviro.com>
References:  <877j19oe9i.wl%rand@meridian-enviro.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tuesday 15 August 2006 17:55, Douglas K. Rand wrote:
> We've got a Microway UP2000+ system that's been working just fine for
> the last year. That is, until it seems to have developed some hardware
> related problems. It started with:
> 
> dc0: watchdog timeout
> dc0: watchdog timeout
> dc0: watchdog timeout
> dc0: watchdog timeout
> dc0: watchdog timeout
> dc0: watchdog timeout
> dc0: watchdog timeout
> ahc0: Timedout SCBs already complete. Interrupts may not be functioning.
> ahc0: Timedout SCBs already complete. Interrupts may not be functioning.
> dc0: watchdog timeout
> dc0: watchdog timeout
> 
> Interestingly the system doesn't crash or completely hang. It stops
> for a bit, considers the answer to the ultimate question (it isn't
> fast enough to think about the actual question) and then works for a
> few minutes. Rinse and repeat.
> 
> And then a few hours later it started having SCSI problems:
> 
> ahc0: Recovery Initiated
> >>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<
> ahc0: Dumping Card State while idle, at SEQADDR 0x18
> Card was paused
> ACCUM = 0x68, SINDEX = 0x48, DINDEX = 0xe4, ARG_2 = 0x1a
> HCNT = 0x0 SCBPTR = 0x68
> SCSISIGI[0xa6]:(REQI|BSYI|MSGI|CDI) ERROR[0x0] SCSIBUSL[0x0]
> LASTPHASE[0x1]:(P_BUSFREE) SCSISEQ[0x1a]:(ENAUTOATNP|ENAUTOATNO|ENRSELI)
> SBLKCTL[0xa]:(SELWIDE|SELBUSB) SCSIRATE[0x0] SEQCTL[0x10]:(FASTMODE)
> SEQ_FLAGS[0xc0]:(NO_CDB_SENT|NOT_IDENTIFIED) SSTAT0[0x0]
> SSTAT1[0x13]:(REQINIT|PHASECHG|PHASEMIS) SSTAT2[0x0]
> SSTAT3[0x0] SIMODE0[0x8]:(ENSWRAP) SIMODE1[0xa4]:(ENSCSIPERR|ENSCSIRST|
ENSELTIMO)
> SXFRCTL0[0x80]:(DFON) DFCNTRL[0x0] DFSTATUS[0x89]:(FIFOEMP|HDONE|
PRELOAD_AVAIL)
> STACK: 0x0 0x154 0x16a 0x17
> SCB count = 192
> Kernel NEXTQSCB = 107
> Card NEXTQSCB = 107
> QINFIFO entries:
> Waiting Queue entries: 104:104
> Disconnected Queue entries:
> QOUTFIFO entries:
> Sequencer Free SCB List:
> Sequencer SCB Info:
> 
> Well, first thing we tried was to replace the NIC. Got a fxp from the
> shelf and tried that. It took 5 hours for it to have problems:
> 
> ahc0: Timedout SCBs already complete. Interrupts may not be functioning.
> ahc0: Timedout SCBs already complete. Interrupts may not be functioning.
> fxp0: device timeout
> fxp0: device timeout
> 
> I had heard that the onboard SCSI sometimes go bad on these
> motherboards, so I grabbed an Adaptec 2940UW from the shelf and tried
> that. (Lucky for me the BIOS was "new" enough to be able to boot from
> the 2940UW.) That lasted about 57 hours, but still ended up with the
> same problem:
> 
> fxp0: device timeout
> ahc1: Timedout SCBs already complete. Interrupts may not be functioning.
> ahc1: Timedout SCBs already complete. Interrupts may not be functioning.
> fxp0: device timeout
> ahc1: Timedout SCBs already complete. Interrupts may not be functioning.
> ahc1: Timedout SCBs already complete. Interrupts may not be functioning.
> ahc1:A:1: no active SCB for reconnecting target - issuing BUS DEVICE RESET
> SAVED_SCSIID == 0x17, SAVED_LUN == 0x0, ARG_1 == 0x17 ACCUM = 0x0
> SEQ_FLAGS == 0xc0, SCBPTR == 0x6, BTT == 0xff, SINDEX == 0x31
> SCSIID == 0x17, SCB_SCSIID == 0x17, SCB_LUN == 0x0, SCB_TAG == 0xff, 
SCB_CONTROL == 0x0
> SCSIBUSL == 0x17, SCSISIGI == 0xe6
> SXFRCTL0 == 0x88
> SEQCTL == 0x10
> 
> We are now in the process of trying different PCI slots for things, so
> far with out any luck. And trying the system with one of the three
> power supplies turned off.

It sounds like interrupts have stopped working.  A couple of questions for 
you:

1) Does it still happen if you disable SMP (set kern.smp.disabled=1 in the 
loader to test)?

2) Does it still happen if you remove PREEMPTION from your kernel config?  
(Can't recall if that was removed in 6.x on Alpha before or after 6.1)

-- 
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200608221035.22244.jhb>