Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 13 Oct 1996 23:20:14 -0400 (EDT)
From:      Brian Tao <taob@io.org>
To:        FREEBSD-SCSI-L <freebsd-scsi@freebsd.org>
Subject:   Wonky controller or drive?
Message-ID:  <Pine.NEB.3.92.961013224954.12078B-100000@zap.io.org>

next in thread | raw e-mail | index | archive | help
    I added a new 4GB drive into our Web/FTP server three days ago
(Thursday morning, Oct 10), and I've been seeing regular panics and
crashes since then.  The kernel messages seem to suggest mostly
otherwise.

    The server has an NCR 53c810 SCSI controller and an SMC 10/100
Mbps Ethernet controller.  The new drive in question is a Quantum
Atlas, at ncr0:1:0.  There is also a 1GB Seagate Medallist (sd0), two
4GB Quantum Grand Prix drives and three other 4GB Quantum Atlas
drives.  All the drives have ARRE and AWRE turned on.

FreeBSD 2.2-960501-SNAP #0: Thu Jun  6 22:56:14 EDT 1996
taob@cabal.io.org:/usr/local/src/2.2-960501-SNAP/sys/compile/WWW

de0 <Digital DC21140 Fast Ethernet> rev 18 int a irq 12 on pci0:9
de0: DC21140 [10-100Mb/s] pass 1.2 Ethernet address 00:00:c0:39:41:c8
de0: enabling 10baseT UTP port
ncr0 <ncr 53c810 scsi> rev 2 int a irq 10 on pci0:11
ncr0 waiting for scsi devices to settle
(ncr0:0:0): "SEAGATE ST51080N 0913" type 0 fixed SCSI 2
sd0(ncr0:0:0): Direct-Access
sd0(ncr0:0:0): FAST SCSI-2 100ns (10 Mb/sec) offset 8.
1030MB (2109840 512 byte sectors)
sd0(ncr0:0:0): with 4826 cyls, 4 heads, and an average 109 sectors/track
(ncr0:1:0): "Quantum XP34300 81HB" type 0 fixed SCSI 2
sd1(ncr0:1:0): Direct-Access
sd1(ncr0:1:0): FAST SCSI-2 100ns (10 Mb/sec) offset 8.
4101MB (8399520 512 byte sectors)
sd1(ncr0:1:0): with 3907 cyls, 20 heads, and an average 107 sectors/track
[...]


    The first crash came Thursday evening.  It looks like the
controller itself failed, but the kernel panic that followed
immediately after seemed to happen in _tcp_fasttimo.  What does "CCB
already dequeued" mean?

ncr0: SCSI phase error fixup: CCB already dequeued (0xf3452800)
ncr0: restart (ncr dead ?).
sd5(ncr0:5:0): error code 114
, retries:3
sd0(ncr0:0:0): UNIT ATTENTION asc:29,0
sd0(ncr0:0:0):  Power on, reset, or bus device reset occurred
 0
current process          = Idle
interrupt mask           =
panic: page fault
syncing disks...
Fatal trap 12: page fault while in kernel mode
fault virtual address    = 0x8c875
fault code               = supervisor read, page not present
instruction pointer      = 0x8:0xf0146613
stack pointer            = 0x10:0xf01caccc
frame pointer            = 0x10:0xf01cacd4
code segment             = base 0x0, limit 0xfffff, type 0x1b
                 = DPL 0, pres 1, def32 1, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process          = Idle
interrupt mask           =
panic: page fault
Automatic reboot in 15 seconds - press a key on the console to abort
Rebooting...


    I didn't get any panic messages for the second crash, but it
looked like the VM system was unable to read pages back into physical
memory:

ncr0: SCSI phase error fixup: CCB already dequeued (0xf3452600)
ncr0: restart (ncr dead ?).
sd1(ncr0:1:0): error code 0
, retries:2
sd0(ncr0:0:0): UNIT ATTENTION asc:29,0
sd0(ncr0:0:0):  Power on, reset, or bus device reset occurred
, retries:3
sd3(ncr0:3:0): UNIT ATTENTION asc:29,0
sd3(ncr0:3:0):  Power on, reset, or bus device reset occurred sks:80,0
, retries:3
sd1(ncr0:1:0): FAST SCSI-2 100ns (10 Mb/sec) offset 8.
ncr0: restart (ncr dead ?).
sd3(ncr0:3:0): FAST SCSI-2 100ns (10 Mb/sec) offset 8.
sd0(ncr0:0:0): FAST SCSI-2 100ns (10 Mb/sec) offset 8.
sd0(ncr0:0:0): UNIT ATTENTION asc:29,0
sd0(ncr0:0:0):  Power on, reset, or bus device reset occurred
, retries:1
sd1(ncr0:1:0): FAST SCSI-2 100ns (10 Mb/sec) offset 8.
sd5(ncr0:5:0): FAST SCSI-2 100ns (10 Mb/sec) offset 8.
sd6(ncr0:6:0): FAST SCSI-2 100ns (10 Mb/sec) offset 8.
spec_getpages: I/O read error
vm_fault: pager input (probably hardware) error, PID 12849 failure
pid 12849 (imagemap), uid 9: exited on signal 11
spec_getpages: I/O read error
vm_fault: pager input (probably hardware) error, PID 12878 failure
pid 12878 (imagemap), uid 9: exited on signal 11
spec_getpages: I/O read error
vm_fault: pager input (probably hardware) error, PID 12923 failure
pid 12923 (imagemap), uid 9: exited on signal 11
spec_getpages: I/O read error
vm_fault: pager input (probably hardware) error, PID 12936 failure
pid 12936 (imagemap), uid 9: exited on signal 11


    A few more instances of "Power on, reset, or bus device reset
occurred" appeared, as well as a couple of "Unrecovered read errors"
on the new Quantum, despite having ARRE enabled.  The third crash in
two days was actually in _tulip_rx_intr, if you believe the
instruction pointer info in the panic message:

ncr0: SCSI phase error fixup: CCB already dequeued (0xf3452400)
ncr0: restart (ncr dead ?).
panic: free: multiple frees
syncing disks...
Fatal trap 12: page fault while in kernel mode
fault virtual address    = 0x39c00000
fault code               = supervisor read, page not present
instruction pointer      = 0x8:0xf017961b
stack pointer            = 0x10:0xefbff9c4
frame pointer            = 0x10:0xefbff9f0
code segment             = base 0x0, limit 0xfffff, type 0x1b
                 = DPL 0, pres 1, def32 1, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process          = 107 (nfsd)
interrupt mask           = net
panic: page fault
Automatic reboot in 15 seconds - press a key on the console to abort
Rebooting...


    That happened twice so far, in _tulip_rx_intr.  The most recent
crash was definitely related to the new Quantum:

assertion "cp" failed: file "../../pci/ncr.c", line 5543
sd1(ncr0:1:0): COMMAND FAILED (4 28) @f3452800.
assertion "cp" failed: file "../../pci/ncr.c", line 5543
sd1(ncr0:1:0): COMMAND FAILED (4 28) @f3452800.
assertion "cp" failed: file "../../pci/ncr.c", line 5543
sd1(ncr0:1:0): COMMAND FAILED (4 28) @f3452800.
assertion "cp" failed: file "../../pci/ncr.c", line 5543
sd1(ncr0:1:0): COMMAND FAILED (4 28) @f3452800.
assertion "cp" failed: file "../../pci/ncr.c", line 5543
sd1(ncr0:1:0): COMMAND FAILED (4 28) @f3452800.
assertion "cp" failed: file "../../pci/ncr.c", line 5543
sd1(ncr0:1:0): COMMAND FAILED (4 28) @f3452800.
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 giving up
Automatic reboot in 15 seconds - press a key on the console to abort
Rebooting...


    So my question is, bad controller or bad drive?  This server,
which was very stable before I put in the new drive, seems to be
having trouble with both its disk and network components?  I don't
have another spare 4GB drive to swap in, and it's the long weekend in
Canada.  :(  Could a marginally bad drive cause all these problems?
--
Brian Tao (BT300, taob@io.org, taob@ican.net)
Senior Systems and Network Administrator, Internet Canada Corp.
"Though this be madness, yet there is method in't"




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.NEB.3.92.961013224954.12078B-100000>