Date: Sun, 13 Oct 1996 23:20:14 -0400 (EDT) From: Brian Tao <taob@io.org> To: FREEBSD-SCSI-L <freebsd-scsi@freebsd.org> Subject: Wonky controller or drive? Message-ID: <Pine.NEB.3.92.961013224954.12078B-100000@zap.io.org>
next in thread | raw e-mail | index | archive | help
I added a new 4GB drive into our Web/FTP server three days ago (Thursday morning, Oct 10), and I've been seeing regular panics and crashes since then. The kernel messages seem to suggest mostly otherwise. The server has an NCR 53c810 SCSI controller and an SMC 10/100 Mbps Ethernet controller. The new drive in question is a Quantum Atlas, at ncr0:1:0. There is also a 1GB Seagate Medallist (sd0), two 4GB Quantum Grand Prix drives and three other 4GB Quantum Atlas drives. All the drives have ARRE and AWRE turned on. FreeBSD 2.2-960501-SNAP #0: Thu Jun 6 22:56:14 EDT 1996 taob@cabal.io.org:/usr/local/src/2.2-960501-SNAP/sys/compile/WWW de0 <Digital DC21140 Fast Ethernet> rev 18 int a irq 12 on pci0:9 de0: DC21140 [10-100Mb/s] pass 1.2 Ethernet address 00:00:c0:39:41:c8 de0: enabling 10baseT UTP port ncr0 <ncr 53c810 scsi> rev 2 int a irq 10 on pci0:11 ncr0 waiting for scsi devices to settle (ncr0:0:0): "SEAGATE ST51080N 0913" type 0 fixed SCSI 2 sd0(ncr0:0:0): Direct-Access sd0(ncr0:0:0): FAST SCSI-2 100ns (10 Mb/sec) offset 8. 1030MB (2109840 512 byte sectors) sd0(ncr0:0:0): with 4826 cyls, 4 heads, and an average 109 sectors/track (ncr0:1:0): "Quantum XP34300 81HB" type 0 fixed SCSI 2 sd1(ncr0:1:0): Direct-Access sd1(ncr0:1:0): FAST SCSI-2 100ns (10 Mb/sec) offset 8. 4101MB (8399520 512 byte sectors) sd1(ncr0:1:0): with 3907 cyls, 20 heads, and an average 107 sectors/track [...] The first crash came Thursday evening. It looks like the controller itself failed, but the kernel panic that followed immediately after seemed to happen in _tcp_fasttimo. What does "CCB already dequeued" mean? ncr0: SCSI phase error fixup: CCB already dequeued (0xf3452800) ncr0: restart (ncr dead ?). sd5(ncr0:5:0): error code 114 , retries:3 sd0(ncr0:0:0): UNIT ATTENTION asc:29,0 sd0(ncr0:0:0): Power on, reset, or bus device reset occurred 0 current process = Idle interrupt mask = panic: page fault syncing disks... Fatal trap 12: page fault while in kernel mode fault virtual address = 0x8c875 fault code = supervisor read, page not present instruction pointer = 0x8:0xf0146613 stack pointer = 0x10:0xf01caccc frame pointer = 0x10:0xf01cacd4 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, def32 1, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = Idle interrupt mask = panic: page fault Automatic reboot in 15 seconds - press a key on the console to abort Rebooting... I didn't get any panic messages for the second crash, but it looked like the VM system was unable to read pages back into physical memory: ncr0: SCSI phase error fixup: CCB already dequeued (0xf3452600) ncr0: restart (ncr dead ?). sd1(ncr0:1:0): error code 0 , retries:2 sd0(ncr0:0:0): UNIT ATTENTION asc:29,0 sd0(ncr0:0:0): Power on, reset, or bus device reset occurred , retries:3 sd3(ncr0:3:0): UNIT ATTENTION asc:29,0 sd3(ncr0:3:0): Power on, reset, or bus device reset occurred sks:80,0 , retries:3 sd1(ncr0:1:0): FAST SCSI-2 100ns (10 Mb/sec) offset 8. ncr0: restart (ncr dead ?). sd3(ncr0:3:0): FAST SCSI-2 100ns (10 Mb/sec) offset 8. sd0(ncr0:0:0): FAST SCSI-2 100ns (10 Mb/sec) offset 8. sd0(ncr0:0:0): UNIT ATTENTION asc:29,0 sd0(ncr0:0:0): Power on, reset, or bus device reset occurred , retries:1 sd1(ncr0:1:0): FAST SCSI-2 100ns (10 Mb/sec) offset 8. sd5(ncr0:5:0): FAST SCSI-2 100ns (10 Mb/sec) offset 8. sd6(ncr0:6:0): FAST SCSI-2 100ns (10 Mb/sec) offset 8. spec_getpages: I/O read error vm_fault: pager input (probably hardware) error, PID 12849 failure pid 12849 (imagemap), uid 9: exited on signal 11 spec_getpages: I/O read error vm_fault: pager input (probably hardware) error, PID 12878 failure pid 12878 (imagemap), uid 9: exited on signal 11 spec_getpages: I/O read error vm_fault: pager input (probably hardware) error, PID 12923 failure pid 12923 (imagemap), uid 9: exited on signal 11 spec_getpages: I/O read error vm_fault: pager input (probably hardware) error, PID 12936 failure pid 12936 (imagemap), uid 9: exited on signal 11 A few more instances of "Power on, reset, or bus device reset occurred" appeared, as well as a couple of "Unrecovered read errors" on the new Quantum, despite having ARRE enabled. The third crash in two days was actually in _tulip_rx_intr, if you believe the instruction pointer info in the panic message: ncr0: SCSI phase error fixup: CCB already dequeued (0xf3452400) ncr0: restart (ncr dead ?). panic: free: multiple frees syncing disks... Fatal trap 12: page fault while in kernel mode fault virtual address = 0x39c00000 fault code = supervisor read, page not present instruction pointer = 0x8:0xf017961b stack pointer = 0x10:0xefbff9c4 frame pointer = 0x10:0xefbff9f0 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, def32 1, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 107 (nfsd) interrupt mask = net panic: page fault Automatic reboot in 15 seconds - press a key on the console to abort Rebooting... That happened twice so far, in _tulip_rx_intr. The most recent crash was definitely related to the new Quantum: assertion "cp" failed: file "../../pci/ncr.c", line 5543 sd1(ncr0:1:0): COMMAND FAILED (4 28) @f3452800. assertion "cp" failed: file "../../pci/ncr.c", line 5543 sd1(ncr0:1:0): COMMAND FAILED (4 28) @f3452800. assertion "cp" failed: file "../../pci/ncr.c", line 5543 sd1(ncr0:1:0): COMMAND FAILED (4 28) @f3452800. assertion "cp" failed: file "../../pci/ncr.c", line 5543 sd1(ncr0:1:0): COMMAND FAILED (4 28) @f3452800. assertion "cp" failed: file "../../pci/ncr.c", line 5543 sd1(ncr0:1:0): COMMAND FAILED (4 28) @f3452800. assertion "cp" failed: file "../../pci/ncr.c", line 5543 sd1(ncr0:1:0): COMMAND FAILED (4 28) @f3452800. 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 giving up Automatic reboot in 15 seconds - press a key on the console to abort Rebooting... So my question is, bad controller or bad drive? This server, which was very stable before I put in the new drive, seems to be having trouble with both its disk and network components? I don't have another spare 4GB drive to swap in, and it's the long weekend in Canada. :( Could a marginally bad drive cause all these problems? -- Brian Tao (BT300, taob@io.org, taob@ican.net) Senior Systems and Network Administrator, Internet Canada Corp. "Though this be madness, yet there is method in't"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.NEB.3.92.961013224954.12078B-100000>