Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 8 Jul 1998 09:17:13 -0700 (PDT)
From:      David Wolfskill <dhw@whistle.com>
To:        freebsd-questions@FreeBSD.ORG
Subject:   Help diagnosing hardware (SCSI) problems?
Message-ID:  <199807081617.JAA06920@pau-amma.whistle.com>

next in thread | raw e-mail | index | archive | help
Maybe this would be more appropriate in -hardware... but I'm not sure.

Anyway, I've got a situation where a primary fileserver is misbehaving
(sometimes, to the point of hanging -- going completely catatonic) after
spitting out error messages such as the following:

Jul  7 11:38:19 shrimp /kernel: sd2(ahc0:2:0): SCB 0x3 - timed out while idle, LASTPHASE == 0x1, SCSISIGI == 0x8
Jul  7 11:38:27 shrimp /kernel: SEQADDR = 0x4 SCSISEQ = 0x5a SSTAT0 = 0x5 SSTAT1 = 0xa
Jul  7 11:38:27 shrimp /kernel: sd2(ahc0:2:0): SCB 3: Immediate reset.  Flags = 0x1
Jul  7 11:38:27 shrimp /kernel: sd2(ahc0:2:0): no longer in timeout
Jul  7 11:38:27 shrimp /kernel: ahc0: Issued Channel A Bus Reset. 5 SCBs aborted
Jul  7 11:38:27 shrimp /kernel: sd1(ahc0:1:0): UNIT ATTENTION asc:29,2 
Jul  7 11:38:27 shrimp /kernel: , retries:3
Jul  7 11:38:27 shrimp /kernel: ahc0:A:0: refuses WIDE negotiation.  Using 8bit transfers
Jul  7 11:38:27 shrimp /kernel: sd0(ahc0:0:0): UNIT ATTENTION asc:29,0
Jul  7 11:38:27 shrimp /kernel: sd0(ahc0:0:0):  Power on, reset, or bus device reset occurred
Jul  7 11:38:27 shrimp /kernel: , retries:3
Jul  7 11:38:27 shrimp /kernel: sd2(ahc0:2:0): UNIT ATTENTION asc:29,0
Jul  7 11:38:27 shrimp /kernel: sd2(ahc0:2:0):  Power on, reset, or bus device reset occurred
Jul  7 11:38:27 shrimp /kernel: , retries:3


I'll try describing the situation & what I've done.  This will be rather
long (~207 lines); sorry, but I don't see much alternative, and you've
been warned....  I should point out, too, that I'm relatively unfamiliar
with PC hardware (or PC anything else, for that matter); I'm rather more
familiar with Sun workstations & IBM mainframes....

The machine ("shrimp") is running 2.2.6-RELEASE.  It has a couple of
(adaptec) SCSI host adapters (a 2940UW as ahc0, & a 2940 as ahc1).

ahc0 is used strictly for the internal devices, and has connections
(only) to its internal connectors (yes, plural):  the boot drive (sd0) &
the CD (cd0) are both narrow devices, and are connected to the
(internal) narrow connector; sd0 is furthest from ahc0 on this leg, and
is terminated.  There are 2 wide disk drives (sd1 & sd2) connected to
the wide (internal) connector; sd1 is furthest from ahc0, and is
terminated.  ahc0 itself is set to "high on/low off" termination.

ahc1 is used strictly for the external devices (a couple of HP disk drives
(sd3 & sd4) and a couple of HP DAT drives (st0 & st1)).  st1 is furthest
from ahc1, and is (externally) terminated.  ahc1 itself relies on Adaptec's
default "automatic" termination, which has not seemed to be a problem in
such a configuration previously.

I'll append an excerpt from /var/log/messages for the most recent boot
after my signature; it should validate the above deathless prose.

Based on the error messages I get on the machine, I came in this morning
to do some literal hardware-hacking:  I had become concerned that the
total length of the SCSI cables on ahc0 might well be too much, so I
halted the machine, powered it off, and used an Xacto knife to chop
excess ribbon cable off.  I thus trimmed about 18" off of the narrow
cable (leaving about 17" in the machine), and about 8" off of the wide
cable (leaving about 17" in the machine).

In the process, I found that some of the pins in the SCSI cable
connectors had been bent.  This obviously did no one any good....  :-(
Anyway, I managed to un-bend them, carefully(!) re-connect them, pull
them, inspect them (to see if any pins were bent); and when everything
looked good, carefully re-connected them again.  (Suppose there's a
market for a mechanism that would allow one to test for bent pins while
leaving the connector in place...?)  I first found such pins on the wide
cable; after fixing that, I then found some bent pins on the external
(narrow) cable, and used the same procedure on them (successfully, as far
as I know).

Although the machine hasn't gone belly-up (yet!), it has been
whimpering:


Jul  8 08:34:26 shrimp /kernel: sd2(ahc0:2:0): SCB 0x0 - timed out while idle, LASTPHASE == 0x1, SCSISIGI == 0x8
Jul  8 08:34:28 shrimp /kernel: SEQADDR = 0x6 SCSISEQ = 0x12 SSTAT0 = 0x5 SSTAT1 = 0xa
Jul  8 08:34:28 shrimp /kernel: sd2(ahc0:2:0): Queueing an Abort SCB
Jul  8 08:34:28 shrimp /kernel: sd2(ahc0:2:0): SCB 0x2 timedout while recovery in progress
Jul  8 08:34:28 shrimp /kernel: sd2(ahc0:2:0): SCB 0x0 - timed out while idle, LASTPHASE == 0x1, SCSISIGI == 0x8
Jul  8 08:34:28 shrimp /kernel: SEQADDR = 0x5 SCSISEQ = 0x5a SSTAT0 = 0x5 SSTAT1 = 0xa
Jul  8 08:34:28 shrimp /kernel: sd2(ahc0:2:0): no longer in timeout
Jul  8 08:34:28 shrimp /kernel: ahc0: Issued Channel A Bus Reset. 2 SCBs aborted
Jul  8 08:34:28 shrimp /kernel: sd2(ahc0:2:0): UNIT ATTENTION asc:29,0
Jul  8 08:34:28 shrimp /kernel: sd2(ahc0:2:0):  Power on, reset, or bus device reset occurred
Jul  8 08:34:28 shrimp /kernel: , retries:3
Jul  8 08:34:28 shrimp /kernel: ahc0:A:0: refuses WIDE negotiation.  Using 8bit transfers
Jul  8 08:34:28 shrimp /kernel: sd0(ahc0:0:0): UNIT ATTENTION asc:29,0
Jul  8 08:34:28 shrimp /kernel: sd0(ahc0:0:0):  Power on, reset, or bus device reset occurred
Jul  8 08:34:28 shrimp /kernel: , retries:4
Jul  8 08:34:33 shrimp /kernel: sd1(ahc0:1:0): UNIT ATTENTION asc:29,2 
Jul  8 08:34:33 shrimp /kernel: , retries:4


At this point, I'm getting fairly frustrated.  :-(

I should mention, also, that when it hung & died yesterday, the SCSI probe
at boot failed to detect the existence of the CD drive.  A power-on-reset
seems to "cure" that -- though the controller's failure to see the device
(absent the POR) is troubling, to say the least.

And the whimpers always occur in conjunction with activity on sd2 (one of
the wide devices on ahc0).

I'm wondering if it *might* be reasonable to change the topology somewhat
-- to connect the internal narrow SCSI devices to the internal connector
of ahc1 (vs. ahc0).  I've already changed the SCSI target IDs on the
external HP disk drives to 2 & 3, so there is no target ID duplication
among the narrow devices.  The intent here is to:

* try to isolate the failure somewhat;

* allow ahc0 to use wide transfers, thus improving performance (assuming
  it doesn't crash & burn).  :-(

I would think that if I do this -- assuming it even makes sense to try
it -- I'll need to "wire down" some of the devices in the kernel, since
the boot drive is one of the narrow devices that would be moved.  And is
there a way to designate which controller would be ahc0 vs. ahc1?  Is it
related to the slot in which the cards are inserted on the system board?

Of course, doing this puts a lot of devices on the narrow controller --
but the CD drive is almost never used, and no more than one tape drive
is in use at any time (well, nearly always), and the tape drive that
*is* in use isn't in use all that much -- it will be used for "amanda"
backups, during (otherwise) comparatively quiescent periods.  So that
leaves 3 disk drives that would be fairly active on the bus.

And is there any software around that may be used to do any sort of
hardware reality checking or validation or diagnostic procedures?  (I'd
vastly prefer something that doesn't require a Microsoft environmant,
since my previous (admittedly limited) experience with any and all such
things has been very negative.)

Of course, the basic objective is to get this machine so it just quietly
does its job; any hints or suggestions toward that end will be most
appreciated.

Thanks,
david
-- 
David Wolfskill		UNIX System Administrator
dhw@whistle.com		voice: (650) 577-7158	pager: (650) 371-4621


Excerpt from /var/log/messages (most recent boot):

Jul  8 04:03:42 shrimp login: ROOT LOGIN (root) ON ttyv0
Jul  8 04:05:42 shrimp halt: halted by root
Jul  8 04:05:43 shrimp syslogd: exiting on signal 15
Jul  8 04:49:36 shrimp /kernel: Copyright (c) 1992-1998 FreeBSD Inc.
Jul  8 04:49:36 shrimp /kernel: Copyright (c) 1982, 1986, 1989, 1991, 1993
Jul  8 04:49:36 shrimp /kernel:         The Regents of the University of California.  All rights reserved.
Jul  8 04:49:36 shrimp /kernel: 
Jul  8 04:49:36 shrimp /kernel: FreeBSD 2.2.6-RELEASE #0: Mon Apr 13 06:54:08 PDT 1998
Jul  8 04:49:36 shrimp /kernel:     dhw@dhw-test1.whistle.com:/usr/src/sys/compile/SHRIMP
Jul  8 04:49:36 shrimp /kernel: CPU: Pentium Pro (199.31-MHz 686-class CPU)
Jul  8 04:49:36 shrimp /kernel:   Origin = "GenuineIntel"  Id = 0x617  Stepping=7
Jul  8 04:49:36 shrimp /kernel:   Features=0xf9ff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR,PGE,MCA,CMOV>
Jul  8 04:49:36 shrimp /kernel: real memory  = 33554432 (32768K bytes)
Jul  8 04:49:36 shrimp /kernel: avail memory = 30621696 (29904K bytes)
Jul  8 04:49:36 shrimp /kernel: Probing for devices on PCI bus 0:
Jul  8 04:49:36 shrimp /kernel: chip0 <Intel 82440FX (Natoma) PCI and memory controller> rev 2 on pci0:0:0
Jul  8 04:49:36 shrimp /kernel: chip1 <Intel 82371SB PCI-ISA bridge> rev 1 on pci0:1:0
Jul  8 04:49:36 shrimp /kernel: chip2 <Intel 82371SB IDE interface> rev 0 on pci0:1:1
Jul  8 04:49:36 shrimp /kernel: pci0:1:2: Intel Corporation, device=0x7020, class=serial, subclass=0x03 int d irq 12 [no driver assigned]
Jul  8 04:49:36 shrimp /kernel: ahc0 <Adaptec 2940 Ultra SCSI host adapter> rev 0 int a irq 12 on pci0:9:0
Jul  8 04:49:36 shrimp /kernel: ahc0: aic7880 Wide Channel, SCSI Id=7, 16 SCBs
Jul  8 04:49:36 shrimp /kernel: ahc0 waiting for scsi devices to settle
Jul  8 04:49:36 shrimp /kernel: ahc0:A:0: refuses WIDE negotiation.  Using 8bit transfers
Jul  8 04:49:36 shrimp /kernel: (ahc0:0:0): "QUANTUM FIREBALL1080S 1Q09" type 0 fixed SCSI 2
Jul  8 04:49:36 shrimp /kernel: sd0(ahc0:0:0): Direct-Access 1042MB (2134305 512 byte sectors)
Jul  8 04:49:36 shrimp /kernel: (ahc0:1:0): "Quantum XP31070W L912" type 0 fixed SCSI 2
Jul  8 04:49:36 shrimp /kernel: sd1(ahc0:1:0): Direct-Access 1075MB (2203480 512 byte sectors)
Jul  8 04:49:36 shrimp /kernel: (ahc0:2:0): "MICROP 4691WS T171" type 0 fixed SCSI 2
Jul  8 04:49:36 shrimp /kernel: sd2(ahc0:2:0): Direct-Access 8681MB (17780058 512 byte sectors)
Jul  8 04:49:36 shrimp /kernel: (ahc0:6:0): "PLEXTOR CD-ROM PX-6XCS 2.06" type 5 removable SCSI 2
Jul  8 04:49:36 shrimp /kernel: cd0(ahc0:6:0): CD-ROM can't get the size
Jul  8 04:49:36 shrimp /kernel: de0 <Digital 21140 Fast Ethernet> rev 18 int a irq 10 on pci0:10:0
Jul  8 04:49:36 shrimp /kernel: de0: SMC 9332DST 21140 [10-100Mb/s] pass 1.2
Jul  8 04:49:36 shrimp /kernel: de0: address 00:00:c0:7f:11:ed
Jul  8 04:49:36 shrimp /kernel: de0: enabling 100baseTX port
Jul  8 04:49:36 shrimp /kernel: ahc1 <Adaptec 2940 SCSI host adapter> rev 3 int a irq 11 on pci0:11:0
Jul  8 04:49:36 shrimp /kernel: ahc1: aic7870 Single Channel, SCSI Id=7, 16 SCBs
Jul  8 04:49:36 shrimp /kernel: ahc1 waiting for scsi devices to settle
Jul  8 04:49:36 shrimp /kernel: (ahc1:2:0): "HP C3725S 6039" type 0 fixed SCSI 2
Jul  8 04:49:36 shrimp /kernel: sd3(ahc1:2:0): Direct-Access 2047MB (4194058 512 byte sectors)
Jul  8 04:49:36 shrimp /kernel: (ahc1:3:0): "HP C3725S 6039" type 0 fixed SCSI 2
Jul  8 04:49:36 shrimp /kernel: sd4(ahc1:3:0): Direct-Access 2047MB (4194058 512 byte sectors)
Jul  8 04:49:36 shrimp /kernel: (ahc1:4:0): "HP C1533A 9406" type 1 removable SCSI 2
Jul  8 04:49:36 shrimp /kernel: st0(ahc1:4:0): Sequential-Access density code 0x24, variable blocks, write-enabled
Jul  8 04:49:36 shrimp /kernel: (ahc1:5:0): "HP C1533A 9503" type 1 removable SCSI 2
Jul  8 04:49:36 shrimp /kernel: st1(ahc1:5:0): Sequential-Access density code 0x24, variable blocks, write-enabled
Jul  8 04:49:36 shrimp /kernel: vga0 <VGA-compatible display device> rev 1 int a irq 12 on pci0:13:0
Jul  8 04:49:36 shrimp /kernel: Probing for devices on the ISA bus:
Jul  8 04:49:36 shrimp /kernel: sc0 at 0x60-0x6f irq 1 on motherboard
Jul  8 04:49:36 shrimp /kernel: sc0: VGA color <16 virtual consoles, flags=0x0>
Jul  8 04:49:36 shrimp /kernel: sio0 at 0x3f8-0x3ff irq 4 on isa
Jul  8 04:49:36 shrimp /kernel: sio0: type 16550A
Jul  8 04:49:36 shrimp /kernel: sio1 at 0x2f8-0x2ff irq 3 on isa
Jul  8 04:49:36 shrimp /kernel: sio1: type 16550A
Jul  8 04:49:36 shrimp /kernel: lpt0 at 0x378-0x37f irq 7 on isa
Jul  8 04:49:36 shrimp /kernel: lpt0: Interrupt-driven port
Jul  8 04:49:36 shrimp /kernel: lp0: TCP/IP capable interface
Jul  8 04:49:36 shrimp /kernel: fdc0 at 0x3f0-0x3f7 irq 6 drq 2 on isa
Jul  8 04:49:36 shrimp /kernel: fdc0: FIFO enabled, 8 bytes threshold
Jul  8 04:49:36 shrimp /kernel: fd0: 1.44MB 3.5in
Jul  8 04:49:36 shrimp /kernel: bt0 not found at 0x330
Jul  8 04:49:36 shrimp /kernel: npx0 flags 0x1 on motherboard
Jul  8 04:49:36 shrimp /kernel: npx0: INT 16 interface
Jul  8 04:49:36 shrimp /kernel: changing root device to st0s1a

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-questions" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199807081617.JAA06920>