From owner-freebsd-questions  Tue Nov 23 19:58:45 1999
Delivered-To: freebsd-questions@freebsd.org
Received: from blockhead.mincom.com (blockhead1.mincom.com [203.55.175.241])
	by hub.freebsd.org (Postfix) with ESMTP id 2273714C8A
	for <questions@freebsd.org>; Tue, 23 Nov 1999 19:58:32 -0800 (PST)
	(envelope-from philh@mincom.com)
Received: (from uucp@localhost)
	by blockhead.mincom.com (8.9.3/8.9.3) id NAA95425
	for <questions@freebsd.org>; Wed, 24 Nov 1999 13:57:39 +1000 (EST)
	(envelope-from philh@mincom.com)
Received: from porthole.mincom.oz.au(172.17.100.2)
 via SMTP by blockhead.mincom.oz.au, id smtpdy95419; Wed Nov 24 13:57:33 1999
Received: (from philh@localhost)
	by porthole.mincom.oz.au (8.8.8/8.8.5) id NAA21842
	for questions@freebsd.org; Wed, 24 Nov 1999 13:57:32 +1000 (EST)
Date: Wed, 24 Nov 1999 13:57:32 +1000
From: Phil Homewood <philh@mincom.com>
To: questions@freebsd.org
Subject: AHC parity errors, timeouts - -STABLE
Message-ID: <19991124135732.E23235@mincom.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.95.5i
Sender: owner-freebsd-questions@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

Anyone know of any changes to the SCSI/CAM/AHC code in -STABLE
in the last two weeks that may cause devices or the bus to go
out to lunch?

I'm in the middle of deploying a swag of near-identical boxes,
and the latest one died last night with console displaying

ahc0: Data Parity Error Detected during address or write data phase
(da0:ahc0:0:0:0): SCB 0x3 - timed out while idle, LASTPHASE == 0x1,
SEQADDR == 0x8
(da0:ahc0:0:0:0): SCB 3: Immediate reset.  Flags = 0x4040
(da0:ahc0:0:0:0): no longer in timeout, status = 34b
ahc0: Issued Channel A Bus Reset. 64 SCBs aborted

repeatable under heavy I/O (find / -print, make buildworld).
After this error appears, the machine locks solid - needs a
poke in the eye to recover.

This machine was only installed (3.3-RELEASE from CD) 5 days ago,
but has already been through the cvsup-buildworld-portsinstall
phase without a glitch once; problems started the day after the
-STABLE buildworld. I don't see any obvious changes (at least
pci/ahc_pci.c hasn't been touched) but my understanding of what
bits of code come into the picture is a little limited.

My bet is on a flaky disk, but I'd like to hear if anyone else is
suffering this or has any ideas before I send the disk back. :-)

dmesg relevant bits:

ahc0: <Adaptec 2940 Ultra2 SCSI adapter> rev 0x00 int a irq 15 on pci0.19.0
ahc0: aic7890/91 Wide Channel A, SCSI Id=7, 16/255 SCBs
...
da0: <QUANTUM ATLAS IV 9 WLS 0808> Fixed Direct Access SCSI-3 device 
da0: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled
da0: 8761MB (17942584 512 byte sectors: 255H 63S/T 1116C)
da1 at ahc0 bus 0 target 1 lun 0
da1: <QUANTUM ATLAS IV 9 WLS 0808> Fixed Direct Access SCSI-3 device 
da1: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled
da1: 8761MB (17942584 512 byte sectors: 255H 63S/T 1116C)

"auto-terminate" is enabled on the card. Drives are connected to the
U2/LVD/SE connector; The other connectors on the card are not connected.
There is a terminator on the cable end. The cable is not too long (around
1200mm by my reckoning, proper twisted LVD cabling.)

Ideas, anyone?
-- 
Phil Homewood             DNRC          email: philh@mincom.com
Postmaster and BOFH
Mincom Ltd                              phone:  +61-7-3303-3524 
Brisbane, QLD Australia                 fax:    +61-7-3303-3269


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-questions" in the body of the message