From owner-freebsd-alpha@FreeBSD.ORG Tue Aug 22 17:01:48 2006 Return-Path: X-Original-To: freebsd-alpha@freebsd.org Delivered-To: freebsd-alpha@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id BCFCB16A4E0 for ; Tue, 22 Aug 2006 17:01:48 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (66-23-211-162.clients.speedfactory.net [66.23.211.162]) by mx1.FreeBSD.org (Postfix) with ESMTP id 530FC43D81 for ; Tue, 22 Aug 2006 17:01:24 +0000 (GMT) (envelope-from jhb@freebsd.org) Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.13.6/8.13.6) with ESMTP id k7MH1KlX047670; Tue, 22 Aug 2006 13:01:20 -0400 (EDT) (envelope-from jhb@freebsd.org) From: John Baldwin To: freebsd-alpha@freebsd.org Date: Tue, 22 Aug 2006 10:35:21 -0400 User-Agent: KMail/1.9.1 References: <877j19oe9i.wl%rand@meridian-enviro.com> In-Reply-To: <877j19oe9i.wl%rand@meridian-enviro.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200608221035.22244.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Tue, 22 Aug 2006 13:01:21 -0400 (EDT) X-Virus-Scanned: ClamAV 0.88.3/1708/Tue Aug 22 08:43:00 2006 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-3.2 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00, PERCENT_RANDOM autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: bryanh@meridian-enviro.com, pedersen@meridian-enviro.com Subject: Re: Problems with UP2000+ X-BeenThere: freebsd-alpha@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Porting FreeBSD to the Alpha List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Aug 2006 17:01:48 -0000 On Tuesday 15 August 2006 17:55, Douglas K. Rand wrote: > We've got a Microway UP2000+ system that's been working just fine for > the last year. That is, until it seems to have developed some hardware > related problems. It started with: > > dc0: watchdog timeout > dc0: watchdog timeout > dc0: watchdog timeout > dc0: watchdog timeout > dc0: watchdog timeout > dc0: watchdog timeout > dc0: watchdog timeout > ahc0: Timedout SCBs already complete. Interrupts may not be functioning. > ahc0: Timedout SCBs already complete. Interrupts may not be functioning. > dc0: watchdog timeout > dc0: watchdog timeout > > Interestingly the system doesn't crash or completely hang. It stops > for a bit, considers the answer to the ultimate question (it isn't > fast enough to think about the actual question) and then works for a > few minutes. Rinse and repeat. > > And then a few hours later it started having SCSI problems: > > ahc0: Recovery Initiated > >>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<< > ahc0: Dumping Card State while idle, at SEQADDR 0x18 > Card was paused > ACCUM = 0x68, SINDEX = 0x48, DINDEX = 0xe4, ARG_2 = 0x1a > HCNT = 0x0 SCBPTR = 0x68 > SCSISIGI[0xa6]:(REQI|BSYI|MSGI|CDI) ERROR[0x0] SCSIBUSL[0x0] > LASTPHASE[0x1]:(P_BUSFREE) SCSISEQ[0x1a]:(ENAUTOATNP|ENAUTOATNO|ENRSELI) > SBLKCTL[0xa]:(SELWIDE|SELBUSB) SCSIRATE[0x0] SEQCTL[0x10]:(FASTMODE) > SEQ_FLAGS[0xc0]:(NO_CDB_SENT|NOT_IDENTIFIED) SSTAT0[0x0] > SSTAT1[0x13]:(REQINIT|PHASECHG|PHASEMIS) SSTAT2[0x0] > SSTAT3[0x0] SIMODE0[0x8]:(ENSWRAP) SIMODE1[0xa4]:(ENSCSIPERR|ENSCSIRST| ENSELTIMO) > SXFRCTL0[0x80]:(DFON) DFCNTRL[0x0] DFSTATUS[0x89]:(FIFOEMP|HDONE| PRELOAD_AVAIL) > STACK: 0x0 0x154 0x16a 0x17 > SCB count = 192 > Kernel NEXTQSCB = 107 > Card NEXTQSCB = 107 > QINFIFO entries: > Waiting Queue entries: 104:104 > Disconnected Queue entries: > QOUTFIFO entries: > Sequencer Free SCB List: > Sequencer SCB Info: > > Well, first thing we tried was to replace the NIC. Got a fxp from the > shelf and tried that. It took 5 hours for it to have problems: > > ahc0: Timedout SCBs already complete. Interrupts may not be functioning. > ahc0: Timedout SCBs already complete. Interrupts may not be functioning. > fxp0: device timeout > fxp0: device timeout > > I had heard that the onboard SCSI sometimes go bad on these > motherboards, so I grabbed an Adaptec 2940UW from the shelf and tried > that. (Lucky for me the BIOS was "new" enough to be able to boot from > the 2940UW.) That lasted about 57 hours, but still ended up with the > same problem: > > fxp0: device timeout > ahc1: Timedout SCBs already complete. Interrupts may not be functioning. > ahc1: Timedout SCBs already complete. Interrupts may not be functioning. > fxp0: device timeout > ahc1: Timedout SCBs already complete. Interrupts may not be functioning. > ahc1: Timedout SCBs already complete. Interrupts may not be functioning. > ahc1:A:1: no active SCB for reconnecting target - issuing BUS DEVICE RESET > SAVED_SCSIID == 0x17, SAVED_LUN == 0x0, ARG_1 == 0x17 ACCUM = 0x0 > SEQ_FLAGS == 0xc0, SCBPTR == 0x6, BTT == 0xff, SINDEX == 0x31 > SCSIID == 0x17, SCB_SCSIID == 0x17, SCB_LUN == 0x0, SCB_TAG == 0xff, SCB_CONTROL == 0x0 > SCSIBUSL == 0x17, SCSISIGI == 0xe6 > SXFRCTL0 == 0x88 > SEQCTL == 0x10 > > We are now in the process of trying different PCI slots for things, so > far with out any luck. And trying the system with one of the three > power supplies turned off. It sounds like interrupts have stopped working. A couple of questions for you: 1) Does it still happen if you disable SMP (set kern.smp.disabled=1 in the loader to test)? 2) Does it still happen if you remove PREEMPTION from your kernel config? (Can't recall if that was removed in 6.x on Alpha before or after 6.1) -- John Baldwin