From owner-freebsd-scsi Mon Oct 14 08:02:50 1996 Return-Path: owner-freebsd-scsi Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id IAA02213 for freebsd-scsi-outgoing; Mon, 14 Oct 1996 08:02:50 -0700 (PDT) Received: from Octopussy (Octopussy.MI.Uni-Koeln.DE [134.95.212.20]) by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id IAA02205 for ; Mon, 14 Oct 1996 08:02:33 -0700 (PDT) Received: from x14.mi.uni-koeln.de (annexr3-13.slip.Uni-Koeln.DE) by Octopussy with SMTP id AA22289 (5.67b/IDA-1.5 for ); Mon, 14 Oct 1996 17:02:07 +0200 Received: (from se@localhost) by x14.mi.uni-koeln.de (8.7.6/8.6.9) id RAA00816; Mon, 14 Oct 1996 17:01:53 +0200 (MET DST) Message-Id: <199610141501.RAA00816@x14.mi.uni-koeln.de> Date: Mon, 14 Oct 1996 17:01:53 +0200 From: se@zpr.uni-koeln.de (Stefan Esser) To: taob@io.org (Brian Tao) Cc: freebsd-scsi@freebsd.org (FREEBSD-SCSI-L) Subject: Re: Wonky controller or drive? In-Reply-To: ; from Brian Tao on Oct 13, 1996 23:20:14 -0400 References: X-Mailer: Mutt 0.45 Mime-Version: 1.0 Sender: owner-freebsd-scsi@freebsd.org X-Loop: FreeBSD.org Precedence: bulk Brian Tao writes: > I added a new 4GB drive into our Web/FTP server three days ago > (Thursday morning, Oct 10), and I've been seeing regular panics and > crashes since then. The kernel messages seem to suggest mostly > otherwise. > > The server has an NCR 53c810 SCSI controller and an SMC 10/100 > Mbps Ethernet controller. The new drive in question is a Quantum > Atlas, at ncr0:1:0. There is also a 1GB Seagate Medallist (sd0), two > 4GB Quantum Grand Prix drives and three other 4GB Quantum Atlas > drives. All the drives have ARRE and AWRE turned on. Hmm, adding the 7th drive caused problems ??? I guess this is the largest disk capacity that ever got connected to a single 53c810 ... :) In order to understand what's wrong, I'd like to know whether these driver are internal or in an external case (and with their own power supplies), the length of the SCSI bus cable (not only to external boxes, but also within them). I expect this to be caused by either a too long cable (for the transfer rate) or a problem with the power supplies. (The 4GB drives need some 15W each under load, but temporary peaks may be much higher and may in fact occur on multiple drives simultanously, depending on how you spread the file systems. The power will be continously drawn on the 5V line, but there may be significant current peaks on the 12V line. I'd suggest to have a power-supply that delivers 2A at 12V per 4GB drive. If all of them are connected to a single PS, then a total of 8A at 12V might be sufficient. > (ncr0:0:0): "SEAGATE ST51080N 0913" type 0 fixed SCSI 2 > (ncr0:1:0): "Quantum XP34300 81HB" type 0 fixed SCSI 2 > The first crash came Thursday evening. It looks like the > controller itself failed, but the kernel panic that followed > immediately after seemed to happen in _tcp_fasttimo. What does "CCB > already dequeued" mean? > > ncr0: SCSI phase error fixup: CCB already dequeued (0xf3452800) > ncr0: restart (ncr dead ?). > sd5(ncr0:5:0): error code 114 > , retries:3 No, that's probably not a controller failure, but a lost SCSI ACK. The CCB already dequeued is a secondary effect, after some code at interrupt level cancelled a SCSI command that was taking too long. > I didn't get any panic messages for the second crash, but it > looked like the VM system was unable to read pages back into physical > memory: > > ncr0: SCSI phase error fixup: CCB already dequeued (0xf3452600) > ncr0: restart (ncr dead ?). Same problem as above: The SCSI bus appears to be locked, and no devcice makes any progress anymore ... > A few more instances of "Power on, reset, or bus device reset > occurred" appeared, as well as a couple of "Unrecovered read errors" > on the new Quantum, despite having ARRE enabled. The third crash in These read errors do most probably indicate, that the data has not been transferred to the NCR completely. > two days was actually in _tulip_rx_intr, if you believe the > instruction pointer info in the panic message: > > ncr0: SCSI phase error fixup: CCB already dequeued (0xf3452400) > ncr0: restart (ncr dead ?). > panic: free: multiple frees This might have been a random coincidence. But I'll check whether I find anything that might explain this. > syncing disks... > Fatal trap 12: page fault while in kernel mode I do not think that this is directly related to the NCR driver ... It does rather look like some kernel data structures got corrupted. > fault virtual address = 0x39c00000 > fault code = supervisor read, page not present > instruction pointer = 0x8:0xf017961b > stack pointer = 0x10:0xefbff9c4 > frame pointer = 0x10:0xefbff9f0 > code segment = base 0x0, limit 0xfffff, type 0x1b > = DPL 0, pres 1, def32 1, gran 1 > processor eflags = interrupt enabled, resume, IOPL = 0 > current process = 107 (nfsd) > interrupt mask = net > That happened twice so far, in _tulip_rx_intr. The most recent > crash was definitely related to the new Quantum: > > assertion "cp" failed: file "../../pci/ncr.c", line 5543 > sd1(ncr0:1:0): COMMAND FAILED (4 28) @f3452800. An SCSI error occured, but no command control block could be identified for the current command. This may in special circumstances happen, if some command gets terminated. I'll think about a more descriptive error message in order to understand what actually happened. > So my question is, bad controller or bad drive? This server, > which was very stable before I put in the new drive, seems to be > having trouble with both its disk and network components? I don't > have another spare 4GB drive to swap in, and it's the long weekend in > Canada. :( Could a marginally bad drive cause all these problems? Well, my first guess would be the SCSI cable being too long (or not good enough) or the peak load on the power supply being too high. You can check the prior by using only slow transfers (async. or at most 5MB/s sync). If the power supply is at its limit, then you should be able to cause failures by increasing the seek rate (ie. do random seeks with little data actually being transferred). Reagrds, STefan