From owner-freebsd-scsi Mon Aug 23 7:16:52 1999 Delivered-To: freebsd-scsi@freebsd.org Received: from ivory.lm.com (ivory.telerama.com [205.201.1.20]) by hub.freebsd.org (Postfix) with ESMTP id 49B7914C33 for ; Mon, 23 Aug 1999 07:16:37 -0700 (PDT) (envelope-from ncrawler@telerama.com) Received: from gauntlet.telerama.com (ncrawler@gauntlet.telerama.com [205.201.1.214]) by ivory.lm.com (8.8.5/8.6.12) with SMTP id KAA29008; Mon, 23 Aug 1999 10:15:47 -0400 (EDT) Date: Mon, 23 Aug 1999 10:15:46 -0400 (EDT) From: Chris Tracy To: freebsd-SCSI@freebsd.org Cc: Chris Tracy Subject: weird SCSI problems... Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-scsi@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Hiyah.. I'm not subscribed to the freebsd-SCSI mailing list, so if anyone has any info about this, it'd be great if you could CC me in... Anyways... Here's the problem: I'm using an Intel Pentium II 350mhz based machine, with a Buslogic SCSI card in it. The machine used to be running the 2.2.x branch, and I've recently wiped off the drives and have installed the 3.2-19990803-STABLE release. For the most part, the machine runs great. However, this is the machine that runs Amanda for us, and one day, while I was running Amanda, I got the following error message, right before the machine crashed... Here's a transript of what exactly happened: ----------------------- % amcheck lm Amanda Tape Server Host Check ----------------------------- /usr/home/holding-disk: 2178706 KB disk space available, using 2076306 KB. NOTE: skipping tape-writable test. Tape Telerama01 label ok. Server check took 9.904 seconds. Amanda Backup Client Hosts Check -------------------------------- WARNING: kappa.webnz.net: selfcheck request timed out. Host down? Client check: 9 hosts checked in 29.163 seconds, 1 problem found. (brought to you by Amanda 2.4.1p1) % amflush -f lm Scanning /usr/home/holding-disk... 19990818: found non-empty Amanda directory. Flushing dumps in 19990818, today: 19990818 to tape drive /dev/nrsa0. Expecting tape Telerama01 or a new tape. (The last dumps were to tape Telerama1 0) Are you sure you want to do this? y driver: send-cmd time 0.011 to taper: START-TAPER 19990818 taper: pid 1390 executable taper version 2.4.1p1 taper: read label `Telerama01' date `19990718' Aug 18 22:13:04 qbert /kernel: (da0:bt0:0:0:0): CCB 0xc5502f00 - timed out Aug 18 22:13:04 qbert /kernel: (da0:bt0:0:0:0): CCB 0xc5502f00 - timed out Aug 18 22:13:21 qbert /kernel: (da0:bt0:0:0:0): CCB 0xc5502f00 - timed out Aug 18 22:13:21 qbert /kernel: bt0: No longer in timeout Aug 18 22:13:21 qbert /kernel: (da0:bt0:0:0:0): CCB 0xc5502f00 - timed out Aug 18 22:13:21 qbert /kernel: bt0: No longer in timeout Aug 18 22:13:21 qbert /kernel: (sa0:bt0:0:2:0): WRITE(06). CDB: a 0 0 80 0 0 % Aug 18 22:13:21 qbert /kernel: (sa0:bt0:0:2:0): WRITE(06). CDB: a 0 0 80 0 0 Aug 18 22:13:21 qbert /kernel: (sa0:bt0:0:2:0): UNIT ATTENTION asc:29,0 Aug 18 22:13:21 qbert /kernel: (sa0:bt0:0:2:0): UNIT ATTENTION asc:29,0 Aug 18 22:13:21 qbert /kernel: (sa0:bt0:0:2:0): Power on, reset, or bus device r eset occurred Aug 18 22:13:21 qbert /kernel: (sa0:bt0:0:2:0): Power on, reset, or bus device r eset occurred % % cd /usr/home/hold ^C Cannot create dfWAA01394: Device not configured queueup: cannot create data temp file dfWAA01394, uid=0: Device not configured zsh: segmentation fault su operator qbert# qbert# Aug 18 22:14:21 qbert /kernel: (da0:bt0:0:0:0): CCB 0xc55024c0 - timed ou t Aug 18 22:14:21 qbert /kernel: (da0:bt0:0:0:0): CCB 0xc55024c0 - timed out Aug 18 22:15:01 qbert /kernel: (da0:bt0:0:0:0): CCB 0xc55024c0 - timed out Aug 18 22:15:01 qbert /kernel: (da0:bt0:0:0:0): CCB 0xc55024c0 - timed out Aug 18 22:15:01 qbert /kernel: bt0: No longer in timeout Aug 18 22:15:01 qbert /kernel: bt0: No longer in timeout Aug 18 22:15:01 qbert /kernel: (da0:bt0:0:0:0): Invalidating pack Aug 18 22:15:01 qbert /kernel: (da0:bt0:0:0:0): Invalidating pack Aug 18 22:15:01 qbert last message repeated 15 times Aug 18 22:15:01 qbert /kernel: spec_getpages: I/O read failure: (error code=6) Aug 18 22:15:01 qbert last message repeated 15 times Aug 18 22:15:01 qbert /kernel: spec_getpages: I/O read failure: (error code=6) Aug 18 22:15:01 qbert /kernel: size: 4096, resid: 4096, a_count: 4096, valid: 0x 0 Aug 18 22:15:01 qbert /kernel: size: 4096, resid: 4096, a_count: 4096, valid: 0x Aug 18 22:15:01 qbert /kernel: nread: 0, reqpage: 0, pindex: 57, pcount: 1 Aug 18 22:15:01 qbert /kernel: nread: 0, reqpage: 0, pindex: 57, pcount: 1 Aug 18 22:15:01 qbert /kernel: vm_fault: pager read error, pid 1383 (csh) Aug 18 22:15:01 qbert /kernel: vm_fault: pager read error, pid 1383 (csh) Aug 18 22:15:01 qbert /kernel: (da0:bt0:0:0:0): Invalidating pack Aug 18 22:15:01 qbert /kernel: (da0:bt0:0:0:0): Invalidating pack Aug 18 22:15:01 qbert sendmail[1401]: NOQUEUE: SYSERR(root): queuename: Cannot c reate "qfWAA01401" in "/var/spool/mqueue" (euid=0): Device not configured Aug 18 22:15:01 qbert sendmail[1401]: NOQUEUE: SYSERR(root): queuename: Cannot c reate "qfWAA01401" in "/var/spool/mqueue" (euid=0): Device not configured Aug 18 22:15:01 qbert sendmail[1394]: WAA01394: SYSERR(operator): Cannot create dfWAA01394: Device not configured Aug 18 22:15:01 qbert sendmail[1394]: WAA01394: SYSERR(operator): Cannot create dfWAA01394: Device not configured Aug 18 22:15:01 qbert sendmail[1394]: WAA01394: SYSERR(operator): queueup: canno t create data temp file dfWAA01394, uid=0: Device not configured Aug 18 22:15:01 qbert sendmail[1394]: WAA01394: SYSERR(operator): queueup: canno t create data temp file dfWAA01394, uid=0: Device not configured Aug 18 22:15:01 qbert sendmail[1394]: WAA01394: SYSERR(operator): queueup: canno t create data temp file dfWAA01394, uid=0: Device not configured Aug 18 22:15:11 qbert sshd[1395]: log: ROOT LOGIN as 'root' from gauntlet.telera ma.com Aug 18 22:15:11 qbert /kernel: vm_fault: pager read error, pid 1403 (zsh) Aug 18 22:15:11 qbert /kernel: vm_fault: pager read error, pid 1403 (zsh) Aug 18 22:15:11 qbert sshd[1395]: fatal: Local: Command terminated on signal 11. Aug 18 22:15:11 qbert sshd[1395]: fatal: Local: Command terminated on signal 11. Aug 18 22:15:19 qbert sshd[1404]: log: ROOT LOGIN as 'root' from gauntlet.telera ma.com Aug 18 22:15:19 qbert /kernel: vm_fault: pager read error, pid 1406 (zsh) Aug 18 22:15:19 qbert /kernel: vm_fault: pager read error, pid 1406 (zsh) Aug 18 22:15:19 qbert sshd[1404]: fatal: Local: Command terminated on signal 11. Aug 18 22:15:19 qbert sshd[1404]: fatal: Local: Command terminated on signal 11. qbert# qbert# qbert# qbert# shutdown -r now zsh: Input/output error: shutdown Aug 18 22:16:36 qbert /kernel: spec_getpages: I/O read failure: (error code=6) qbert# Aug 18 22:16:36 qbert /kernel: spec_getpages: I/O read failure: (error co de=6) Aug 18 22:16:36 qbert /kernel: size: 65536, resid: 65536, a_count: 65536, valid: 0x0 Aug 18 22:16:36 qbert /kernel: size: 65536, resid: 65536, a_count: 65536, valid: 0x0 Aug 18 22:16:36 qbert /kernel: nread: 0, reqpage: 0, pindex: 0, pcount: 16 Aug 18 22:16:36 qbert /kernel: nread: 0, reqpage: 0, pindex: 0, pcount: 16 Aug 18 22:16:36 qbert /kernel: spec_getpages: I/O read failure: (error code=6) Aug 18 22:16:36 qbert /kernel: spec_getpages: I/O read failure: (error code=6) Aug 18 22:16:36 qbert /kernel: size: 65536, resid: 65536, a_count: 65536, valid: 0x0 Aug 18 22:16:36 qbert /kernel: size: 65536, resid: 65536, a_count: 65536, valid: 0x0 Aug 18 22:16:36 qbert /kernel: nread: 0, reqpage: 0, pindex: 0, pcount: 16 Aug 18 22:16:36 qbert /kernel: nread: 0, reqpage: 0, pindex: 0, pcount: 16 -------------------------- So basically it looks like some part of our SCSI bus is failing hardcore... Here is the results of the 'dmesg' command so everyone can see exactly how this machine's hardware is configured..... -------------------------- Copyright (c) 1992-1999 FreeBSD Inc. Copyright (c) 1982, 1986, 1989, 1991, 1993 The Regents of the University of California. All rights reserved. FreeBSD 3.2-19990803-STABLE #0: Wed Aug 18 13:55:26 EDT 1999 root@qbert.telerama.com:/usr/src/sys/compile/QBERT Timecounter "i8254" frequency 1193182 Hz CPU: Pentium II (299.75-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0x634 Stepping = 4 Features=0x80fbff real memory = 134217728 (131072K bytes) config> di zp0 config> di ze0 config> di lnc0 config> di le0 config> di ie0 config> di fe0 config> di ep0 config> di ed0 config> di cs0 config> di wt0 config> di scd0 config> di mcd0 config> di matcdc0 config> di aha0 config> di adv0 config> q avail memory = 126984192 (124008K bytes) Preloaded elf kernel "kernel" at 0xc036b000. Preloaded userconfig_script "/boot/kernel.conf" at 0xc036b09c. Probing for devices on PCI bus 0: chip0: rev 0x03 on pci0.0.0 chip1: rev 0x03 on pci0.1.0 chip2: rev 0x01 on pci0.7.0 ide_pci0: rev 0x01 on pci0.7.1 chip3: rev 0x01 on pci0.7.3 bt0: rev 0x08 int a irq 11 on pci0.11.0 bt0: BT-958 FW Rev. 5.07B Ultra Wide SCSI Host Adapter, SCSI ID 7, 192 CCBs fxp0: rev 0x05 int a irq 9 on pci0.12.0 fxp0: Ethernet address 00:a0:c9:db:03:18 Probing for devices on PCI bus 1: Probing for PnP devices: Probing for devices on the ISA bus: sc0 on isa sc0: VGA color <16 virtual consoles, flags=0x0> atkbdc0 at 0x60-0x6f on motherboard atkbd0 irq 1 on isa psm0 not found sio0 at 0x3f8-0x3ff irq 4 flags 0x10 on isa sio0: type 16550A sio1 at 0x2f8-0x2ff irq 3 on isa sio1: type 16550A fdc0 at 0x3f0-0x3f7 irq 6 drq 2 on isa fdc0: FIFO enabled, 8 bytes threshold fd0: 1.44MB 3.5in wdc0 not found at 0x1f0 wdc1 not found at 0x170 ppc0 at 0x378 irq 7 flags 0x40 on isa ppc0: Generic chipset (ECP/PS2/NIBBLE) in COMPATIBLE mode ppc0: FIFO with 16/16/8 bytes threshold lpt0: on ppbus 0 lpt0: Interrupt-driven port ppi0: on ppbus 0 plip0: on ppbus 0 ex0 not found bt: unit number (1) too high bt1 not found at 0x330 vga0 at 0x3b0-0x3df maddr 0xa0000 msize 131072 on isa npx0 on motherboard npx0: INT 16 interface Waiting 15 seconds for SCSI devices to settle sa0 at bt0 bus 0 target 2 lun 0 sa0: Removable Sequential Access SCSI-2 device sa0: 10.000MB/s transfers (10.000MHz, offset 15) da1 at bt0 bus 0 target 1 lun 0 da1: Fixed Direct Access SCSI-2 device da1: 20.000MB/s transfers (10.000MHz, offset 15, 16bit), Tagged Queueing Enabled da1: 4340MB (8888924 512 byte sectors: 255H 63S/T 553C) da0 at bt0 bus 0 target 0 lun 0 da0: Fixed Direct Access SCSI-2 device da0: 20.000MB/s transfers (10.000MHz, offset 15, 16bit), Tagged Queueing Enabled da0: 4340MB (8888924 512 byte sectors: 255H 63S/T 553C) changing root device to da0s1a WARNING: / was not properly dismounted ----- As you can see, this machine has 3 SCSI devices -- 0 and 1 are internal 4GB seagate barracudas, 2 is our external HP tape drive. So anyways, I am pretty convinced it is either the card, or one of the SCSI devices, or maybe even a bug in the BusLogic SCSI driver (I doubt it, but who knows..heheh)? It seems as though something on the SCSI bus reset itself or something, from what I've seen in the error message.. Could this be a termination problem? I've doublechecked all of our termination, and it seems to be OK !?!? ... If anyone has any suggestions, even on things to try, I'd appreciate it! FYI -- this particular problem has only happened once. The machine HAS been working OK since this happened, but I'm convinced it could happen again... Like I said, I'm not subscribed to this list, so please CC me in any response.. I will check back to the mailing list anyways even if I don't hear anything in e-mail. Thanks in advance!!! -Chris To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-scsi" in the body of the message