From owner-freebsd-scsi  Mon Oct 14 08:02:50 1996
Return-Path: owner-freebsd-scsi
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id IAA02213
          for freebsd-scsi-outgoing; Mon, 14 Oct 1996 08:02:50 -0700 (PDT)
Received: from Octopussy (Octopussy.MI.Uni-Koeln.DE [134.95.212.20])
          by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id IAA02205
          for <freebsd-scsi@freebsd.org>; Mon, 14 Oct 1996 08:02:33 -0700 (PDT)
Received: from x14.mi.uni-koeln.de (annexr3-13.slip.Uni-Koeln.DE) by Octopussy with SMTP id AA22289
  (5.67b/IDA-1.5 for <freebsd-scsi@freebsd.org>); Mon, 14 Oct 1996 17:02:07 +0200
Received: (from se@localhost) by x14.mi.uni-koeln.de (8.7.6/8.6.9) id RAA00816; Mon, 14 Oct 1996 17:01:53 +0200 (MET DST)
Message-Id: <199610141501.RAA00816@x14.mi.uni-koeln.de>
Date: Mon, 14 Oct 1996 17:01:53 +0200
From: se@zpr.uni-koeln.de (Stefan Esser)
To: taob@io.org (Brian Tao)
Cc: freebsd-scsi@freebsd.org (FREEBSD-SCSI-L)
Subject: Re: Wonky controller or drive?
In-Reply-To: <Pine.NEB.3.92.961013224954.12078B-100000@zap.io.org>; from Brian Tao on Oct 13, 1996 23:20:14 -0400
References: <Pine.NEB.3.92.961013224954.12078B-100000@zap.io.org>
X-Mailer: Mutt 0.45
Mime-Version: 1.0
Sender: owner-freebsd-scsi@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

Brian Tao writes:
>     I added a new 4GB drive into our Web/FTP server three days ago
> (Thursday morning, Oct 10), and I've been seeing regular panics and
> crashes since then.  The kernel messages seem to suggest mostly
> otherwise.
> 
>     The server has an NCR 53c810 SCSI controller and an SMC 10/100
> Mbps Ethernet controller.  The new drive in question is a Quantum
> Atlas, at ncr0:1:0.  There is also a 1GB Seagate Medallist (sd0), two
> 4GB Quantum Grand Prix drives and three other 4GB Quantum Atlas
> drives.  All the drives have ARRE and AWRE turned on.

Hmm, adding the 7th drive caused problems ???

I guess this is the largest disk capacity that 
ever got connected to a single 53c810 ... :)

In order to understand what's wrong, I'd like
to know whether these driver are internal or
in an external case (and with their own power 
supplies), the length of the SCSI bus cable
(not only to external boxes, but also within
them).

I expect this to be caused by either a too
long cable (for the transfer rate) or a problem
with the power supplies. (The 4GB drives need
some 15W each under load, but temporary peaks 
may be much higher and may in fact occur on
multiple drives simultanously, depending on
how you spread the file systems. The power
will be continously drawn on the 5V line, but
there may be significant current peaks on the
12V line. I'd suggest to have a power-supply 
that delivers 2A at 12V per 4GB drive. If all
of them are connected to a single PS, then a 
total of 8A at 12V might be sufficient.

> (ncr0:0:0): "SEAGATE ST51080N 0913" type 0 fixed SCSI 2
> (ncr0:1:0): "Quantum XP34300 81HB" type 0 fixed SCSI 2

>     The first crash came Thursday evening.  It looks like the
> controller itself failed, but the kernel panic that followed
> immediately after seemed to happen in _tcp_fasttimo.  What does "CCB
> already dequeued" mean?
> 
> ncr0: SCSI phase error fixup: CCB already dequeued (0xf3452800)
> ncr0: restart (ncr dead ?).
> sd5(ncr0:5:0): error code 114
> , retries:3

No, that's probably not a controller failure,
but a lost SCSI ACK.

The CCB already dequeued is a secondary effect,
after some code at interrupt level cancelled 
a SCSI command that was taking too long.

>     I didn't get any panic messages for the second crash, but it
> looked like the VM system was unable to read pages back into physical
> memory:
> 
> ncr0: SCSI phase error fixup: CCB already dequeued (0xf3452600)
> ncr0: restart (ncr dead ?).

Same problem as above: The SCSI bus appears to 
be locked, and no devcice makes any progress
anymore ...

>     A few more instances of "Power on, reset, or bus device reset
> occurred" appeared, as well as a couple of "Unrecovered read errors"
> on the new Quantum, despite having ARRE enabled.  The third crash in

These read errors do most probably indicate, 
that the data has not been transferred to 
the NCR completely.

> two days was actually in _tulip_rx_intr, if you believe the
> instruction pointer info in the panic message:
> 
> ncr0: SCSI phase error fixup: CCB already dequeued (0xf3452400)
> ncr0: restart (ncr dead ?).
> panic: free: multiple frees

This might have been a random coincidence.
But I'll check whether I find anything that 
might explain this. 

> syncing disks...
> Fatal trap 12: page fault while in kernel mode

I do not think that this is directly related
to the NCR driver ...
It does rather look like some kernel data structures
got corrupted.

> fault virtual address    = 0x39c00000
> fault code               = supervisor read, page not present
> instruction pointer      = 0x8:0xf017961b
> stack pointer            = 0x10:0xefbff9c4
> frame pointer            = 0x10:0xefbff9f0
> code segment             = base 0x0, limit 0xfffff, type 0x1b
>                  = DPL 0, pres 1, def32 1, gran 1
> processor eflags = interrupt enabled, resume, IOPL = 0
> current process          = 107 (nfsd)
> interrupt mask           = net

>     That happened twice so far, in _tulip_rx_intr.  The most recent
> crash was definitely related to the new Quantum:
> 
> assertion "cp" failed: file "../../pci/ncr.c", line 5543
> sd1(ncr0:1:0): COMMAND FAILED (4 28) @f3452800.

An SCSI error occured, but no command control block
could be identified for the current command. This 
may in special circumstances happen, if some command
gets terminated. I'll think about a more descriptive
error message in order to understand what actually
happened.

>     So my question is, bad controller or bad drive?  This server,
> which was very stable before I put in the new drive, seems to be
> having trouble with both its disk and network components?  I don't
> have another spare 4GB drive to swap in, and it's the long weekend in
> Canada.  :(  Could a marginally bad drive cause all these problems?

Well, my first guess would be the SCSI cable being 
too long (or not good enough) or the peak load on the
power supply being too high.

You can check the prior by using only slow transfers
(async. or at most 5MB/s sync). If the power supply
is at its limit, then you should be able to cause 
failures by increasing the seek rate (ie. do random
seeks with little data actually being transferred).

Reagrds, STefan