From owner-freebsd-questions Wed Jul 8 09:18:36 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id JAA21673 for freebsd-questions-outgoing; Wed, 8 Jul 1998 09:18:36 -0700 (PDT) (envelope-from owner-freebsd-questions@FreeBSD.ORG) Received: from pau-amma.whistle.com (s205m64.whistle.com [207.76.205.64]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id JAA21668 for ; Wed, 8 Jul 1998 09:18:35 -0700 (PDT) (envelope-from dhw@whistle.com) Received: (from dhw@localhost) by pau-amma.whistle.com (8.8.8/8.8.7) id JAA06920 for freebsd-questions@freebsd.org; Wed, 8 Jul 1998 09:17:13 -0700 (PDT) (envelope-from dhw) Date: Wed, 8 Jul 1998 09:17:13 -0700 (PDT) From: David Wolfskill Message-Id: <199807081617.JAA06920@pau-amma.whistle.com> To: freebsd-questions@FreeBSD.ORG Subject: Help diagnosing hardware (SCSI) problems? Sender: owner-freebsd-questions@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Maybe this would be more appropriate in -hardware... but I'm not sure. Anyway, I've got a situation where a primary fileserver is misbehaving (sometimes, to the point of hanging -- going completely catatonic) after spitting out error messages such as the following: Jul 7 11:38:19 shrimp /kernel: sd2(ahc0:2:0): SCB 0x3 - timed out while idle, LASTPHASE == 0x1, SCSISIGI == 0x8 Jul 7 11:38:27 shrimp /kernel: SEQADDR = 0x4 SCSISEQ = 0x5a SSTAT0 = 0x5 SSTAT1 = 0xa Jul 7 11:38:27 shrimp /kernel: sd2(ahc0:2:0): SCB 3: Immediate reset. Flags = 0x1 Jul 7 11:38:27 shrimp /kernel: sd2(ahc0:2:0): no longer in timeout Jul 7 11:38:27 shrimp /kernel: ahc0: Issued Channel A Bus Reset. 5 SCBs aborted Jul 7 11:38:27 shrimp /kernel: sd1(ahc0:1:0): UNIT ATTENTION asc:29,2 Jul 7 11:38:27 shrimp /kernel: , retries:3 Jul 7 11:38:27 shrimp /kernel: ahc0:A:0: refuses WIDE negotiation. Using 8bit transfers Jul 7 11:38:27 shrimp /kernel: sd0(ahc0:0:0): UNIT ATTENTION asc:29,0 Jul 7 11:38:27 shrimp /kernel: sd0(ahc0:0:0): Power on, reset, or bus device reset occurred Jul 7 11:38:27 shrimp /kernel: , retries:3 Jul 7 11:38:27 shrimp /kernel: sd2(ahc0:2:0): UNIT ATTENTION asc:29,0 Jul 7 11:38:27 shrimp /kernel: sd2(ahc0:2:0): Power on, reset, or bus device reset occurred Jul 7 11:38:27 shrimp /kernel: , retries:3 I'll try describing the situation & what I've done. This will be rather long (~207 lines); sorry, but I don't see much alternative, and you've been warned.... I should point out, too, that I'm relatively unfamiliar with PC hardware (or PC anything else, for that matter); I'm rather more familiar with Sun workstations & IBM mainframes.... The machine ("shrimp") is running 2.2.6-RELEASE. It has a couple of (adaptec) SCSI host adapters (a 2940UW as ahc0, & a 2940 as ahc1). ahc0 is used strictly for the internal devices, and has connections (only) to its internal connectors (yes, plural): the boot drive (sd0) & the CD (cd0) are both narrow devices, and are connected to the (internal) narrow connector; sd0 is furthest from ahc0 on this leg, and is terminated. There are 2 wide disk drives (sd1 & sd2) connected to the wide (internal) connector; sd1 is furthest from ahc0, and is terminated. ahc0 itself is set to "high on/low off" termination. ahc1 is used strictly for the external devices (a couple of HP disk drives (sd3 & sd4) and a couple of HP DAT drives (st0 & st1)). st1 is furthest from ahc1, and is (externally) terminated. ahc1 itself relies on Adaptec's default "automatic" termination, which has not seemed to be a problem in such a configuration previously. I'll append an excerpt from /var/log/messages for the most recent boot after my signature; it should validate the above deathless prose. Based on the error messages I get on the machine, I came in this morning to do some literal hardware-hacking: I had become concerned that the total length of the SCSI cables on ahc0 might well be too much, so I halted the machine, powered it off, and used an Xacto knife to chop excess ribbon cable off. I thus trimmed about 18" off of the narrow cable (leaving about 17" in the machine), and about 8" off of the wide cable (leaving about 17" in the machine). In the process, I found that some of the pins in the SCSI cable connectors had been bent. This obviously did no one any good.... :-( Anyway, I managed to un-bend them, carefully(!) re-connect them, pull them, inspect them (to see if any pins were bent); and when everything looked good, carefully re-connected them again. (Suppose there's a market for a mechanism that would allow one to test for bent pins while leaving the connector in place...?) I first found such pins on the wide cable; after fixing that, I then found some bent pins on the external (narrow) cable, and used the same procedure on them (successfully, as far as I know). Although the machine hasn't gone belly-up (yet!), it has been whimpering: Jul 8 08:34:26 shrimp /kernel: sd2(ahc0:2:0): SCB 0x0 - timed out while idle, LASTPHASE == 0x1, SCSISIGI == 0x8 Jul 8 08:34:28 shrimp /kernel: SEQADDR = 0x6 SCSISEQ = 0x12 SSTAT0 = 0x5 SSTAT1 = 0xa Jul 8 08:34:28 shrimp /kernel: sd2(ahc0:2:0): Queueing an Abort SCB Jul 8 08:34:28 shrimp /kernel: sd2(ahc0:2:0): SCB 0x2 timedout while recovery in progress Jul 8 08:34:28 shrimp /kernel: sd2(ahc0:2:0): SCB 0x0 - timed out while idle, LASTPHASE == 0x1, SCSISIGI == 0x8 Jul 8 08:34:28 shrimp /kernel: SEQADDR = 0x5 SCSISEQ = 0x5a SSTAT0 = 0x5 SSTAT1 = 0xa Jul 8 08:34:28 shrimp /kernel: sd2(ahc0:2:0): no longer in timeout Jul 8 08:34:28 shrimp /kernel: ahc0: Issued Channel A Bus Reset. 2 SCBs aborted Jul 8 08:34:28 shrimp /kernel: sd2(ahc0:2:0): UNIT ATTENTION asc:29,0 Jul 8 08:34:28 shrimp /kernel: sd2(ahc0:2:0): Power on, reset, or bus device reset occurred Jul 8 08:34:28 shrimp /kernel: , retries:3 Jul 8 08:34:28 shrimp /kernel: ahc0:A:0: refuses WIDE negotiation. Using 8bit transfers Jul 8 08:34:28 shrimp /kernel: sd0(ahc0:0:0): UNIT ATTENTION asc:29,0 Jul 8 08:34:28 shrimp /kernel: sd0(ahc0:0:0): Power on, reset, or bus device reset occurred Jul 8 08:34:28 shrimp /kernel: , retries:4 Jul 8 08:34:33 shrimp /kernel: sd1(ahc0:1:0): UNIT ATTENTION asc:29,2 Jul 8 08:34:33 shrimp /kernel: , retries:4 At this point, I'm getting fairly frustrated. :-( I should mention, also, that when it hung & died yesterday, the SCSI probe at boot failed to detect the existence of the CD drive. A power-on-reset seems to "cure" that -- though the controller's failure to see the device (absent the POR) is troubling, to say the least. And the whimpers always occur in conjunction with activity on sd2 (one of the wide devices on ahc0). I'm wondering if it *might* be reasonable to change the topology somewhat -- to connect the internal narrow SCSI devices to the internal connector of ahc1 (vs. ahc0). I've already changed the SCSI target IDs on the external HP disk drives to 2 & 3, so there is no target ID duplication among the narrow devices. The intent here is to: * try to isolate the failure somewhat; * allow ahc0 to use wide transfers, thus improving performance (assuming it doesn't crash & burn). :-( I would think that if I do this -- assuming it even makes sense to try it -- I'll need to "wire down" some of the devices in the kernel, since the boot drive is one of the narrow devices that would be moved. And is there a way to designate which controller would be ahc0 vs. ahc1? Is it related to the slot in which the cards are inserted on the system board? Of course, doing this puts a lot of devices on the narrow controller -- but the CD drive is almost never used, and no more than one tape drive is in use at any time (well, nearly always), and the tape drive that *is* in use isn't in use all that much -- it will be used for "amanda" backups, during (otherwise) comparatively quiescent periods. So that leaves 3 disk drives that would be fairly active on the bus. And is there any software around that may be used to do any sort of hardware reality checking or validation or diagnostic procedures? (I'd vastly prefer something that doesn't require a Microsoft environmant, since my previous (admittedly limited) experience with any and all such things has been very negative.) Of course, the basic objective is to get this machine so it just quietly does its job; any hints or suggestions toward that end will be most appreciated. Thanks, david -- David Wolfskill UNIX System Administrator dhw@whistle.com voice: (650) 577-7158 pager: (650) 371-4621 Excerpt from /var/log/messages (most recent boot): Jul 8 04:03:42 shrimp login: ROOT LOGIN (root) ON ttyv0 Jul 8 04:05:42 shrimp halt: halted by root Jul 8 04:05:43 shrimp syslogd: exiting on signal 15 Jul 8 04:49:36 shrimp /kernel: Copyright (c) 1992-1998 FreeBSD Inc. Jul 8 04:49:36 shrimp /kernel: Copyright (c) 1982, 1986, 1989, 1991, 1993 Jul 8 04:49:36 shrimp /kernel: The Regents of the University of California. All rights reserved. Jul 8 04:49:36 shrimp /kernel: Jul 8 04:49:36 shrimp /kernel: FreeBSD 2.2.6-RELEASE #0: Mon Apr 13 06:54:08 PDT 1998 Jul 8 04:49:36 shrimp /kernel: dhw@dhw-test1.whistle.com:/usr/src/sys/compile/SHRIMP Jul 8 04:49:36 shrimp /kernel: CPU: Pentium Pro (199.31-MHz 686-class CPU) Jul 8 04:49:36 shrimp /kernel: Origin = "GenuineIntel" Id = 0x617 Stepping=7 Jul 8 04:49:36 shrimp /kernel: Features=0xf9ff Jul 8 04:49:36 shrimp /kernel: real memory = 33554432 (32768K bytes) Jul 8 04:49:36 shrimp /kernel: avail memory = 30621696 (29904K bytes) Jul 8 04:49:36 shrimp /kernel: Probing for devices on PCI bus 0: Jul 8 04:49:36 shrimp /kernel: chip0 rev 2 on pci0:0:0 Jul 8 04:49:36 shrimp /kernel: chip1 rev 1 on pci0:1:0 Jul 8 04:49:36 shrimp /kernel: chip2 rev 0 on pci0:1:1 Jul 8 04:49:36 shrimp /kernel: pci0:1:2: Intel Corporation, device=0x7020, class=serial, subclass=0x03 int d irq 12 [no driver assigned] Jul 8 04:49:36 shrimp /kernel: ahc0 rev 0 int a irq 12 on pci0:9:0 Jul 8 04:49:36 shrimp /kernel: ahc0: aic7880 Wide Channel, SCSI Id=7, 16 SCBs Jul 8 04:49:36 shrimp /kernel: ahc0 waiting for scsi devices to settle Jul 8 04:49:36 shrimp /kernel: ahc0:A:0: refuses WIDE negotiation. Using 8bit transfers Jul 8 04:49:36 shrimp /kernel: (ahc0:0:0): "QUANTUM FIREBALL1080S 1Q09" type 0 fixed SCSI 2 Jul 8 04:49:36 shrimp /kernel: sd0(ahc0:0:0): Direct-Access 1042MB (2134305 512 byte sectors) Jul 8 04:49:36 shrimp /kernel: (ahc0:1:0): "Quantum XP31070W L912" type 0 fixed SCSI 2 Jul 8 04:49:36 shrimp /kernel: sd1(ahc0:1:0): Direct-Access 1075MB (2203480 512 byte sectors) Jul 8 04:49:36 shrimp /kernel: (ahc0:2:0): "MICROP 4691WS T171" type 0 fixed SCSI 2 Jul 8 04:49:36 shrimp /kernel: sd2(ahc0:2:0): Direct-Access 8681MB (17780058 512 byte sectors) Jul 8 04:49:36 shrimp /kernel: (ahc0:6:0): "PLEXTOR CD-ROM PX-6XCS 2.06" type 5 removable SCSI 2 Jul 8 04:49:36 shrimp /kernel: cd0(ahc0:6:0): CD-ROM can't get the size Jul 8 04:49:36 shrimp /kernel: de0 rev 18 int a irq 10 on pci0:10:0 Jul 8 04:49:36 shrimp /kernel: de0: SMC 9332DST 21140 [10-100Mb/s] pass 1.2 Jul 8 04:49:36 shrimp /kernel: de0: address 00:00:c0:7f:11:ed Jul 8 04:49:36 shrimp /kernel: de0: enabling 100baseTX port Jul 8 04:49:36 shrimp /kernel: ahc1 rev 3 int a irq 11 on pci0:11:0 Jul 8 04:49:36 shrimp /kernel: ahc1: aic7870 Single Channel, SCSI Id=7, 16 SCBs Jul 8 04:49:36 shrimp /kernel: ahc1 waiting for scsi devices to settle Jul 8 04:49:36 shrimp /kernel: (ahc1:2:0): "HP C3725S 6039" type 0 fixed SCSI 2 Jul 8 04:49:36 shrimp /kernel: sd3(ahc1:2:0): Direct-Access 2047MB (4194058 512 byte sectors) Jul 8 04:49:36 shrimp /kernel: (ahc1:3:0): "HP C3725S 6039" type 0 fixed SCSI 2 Jul 8 04:49:36 shrimp /kernel: sd4(ahc1:3:0): Direct-Access 2047MB (4194058 512 byte sectors) Jul 8 04:49:36 shrimp /kernel: (ahc1:4:0): "HP C1533A 9406" type 1 removable SCSI 2 Jul 8 04:49:36 shrimp /kernel: st0(ahc1:4:0): Sequential-Access density code 0x24, variable blocks, write-enabled Jul 8 04:49:36 shrimp /kernel: (ahc1:5:0): "HP C1533A 9503" type 1 removable SCSI 2 Jul 8 04:49:36 shrimp /kernel: st1(ahc1:5:0): Sequential-Access density code 0x24, variable blocks, write-enabled Jul 8 04:49:36 shrimp /kernel: vga0 rev 1 int a irq 12 on pci0:13:0 Jul 8 04:49:36 shrimp /kernel: Probing for devices on the ISA bus: Jul 8 04:49:36 shrimp /kernel: sc0 at 0x60-0x6f irq 1 on motherboard Jul 8 04:49:36 shrimp /kernel: sc0: VGA color <16 virtual consoles, flags=0x0> Jul 8 04:49:36 shrimp /kernel: sio0 at 0x3f8-0x3ff irq 4 on isa Jul 8 04:49:36 shrimp /kernel: sio0: type 16550A Jul 8 04:49:36 shrimp /kernel: sio1 at 0x2f8-0x2ff irq 3 on isa Jul 8 04:49:36 shrimp /kernel: sio1: type 16550A Jul 8 04:49:36 shrimp /kernel: lpt0 at 0x378-0x37f irq 7 on isa Jul 8 04:49:36 shrimp /kernel: lpt0: Interrupt-driven port Jul 8 04:49:36 shrimp /kernel: lp0: TCP/IP capable interface Jul 8 04:49:36 shrimp /kernel: fdc0 at 0x3f0-0x3f7 irq 6 drq 2 on isa Jul 8 04:49:36 shrimp /kernel: fdc0: FIFO enabled, 8 bytes threshold Jul 8 04:49:36 shrimp /kernel: fd0: 1.44MB 3.5in Jul 8 04:49:36 shrimp /kernel: bt0 not found at 0x330 Jul 8 04:49:36 shrimp /kernel: npx0 flags 0x1 on motherboard Jul 8 04:49:36 shrimp /kernel: npx0: INT 16 interface Jul 8 04:49:36 shrimp /kernel: changing root device to st0s1a To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-questions" in the body of the message