From owner-freebsd-stable  Wed Apr  4 11:49:12 2001
Delivered-To: freebsd-stable@freebsd.org
Received: from boromir.vpop.net (dns1.vpop.net [206.117.147.2])
	by hub.freebsd.org (Postfix) with ESMTP
	id 2255F37B71F; Wed,  4 Apr 2001 11:48:56 -0700 (PDT)
	(envelope-from mreimer@vpop.net)
Received: from vpop.net ([209.102.16.48])
	by boromir.vpop.net (8.11.2/8.11.2) with ESMTP id f34Imln72065;
	Wed, 4 Apr 2001 11:48:47 -0700 (PDT)
	(envelope-from mreimer@vpop.net)
Message-ID: <3ACB6C90.750FF100@vpop.net>
Date: Wed, 04 Apr 2001 11:48:48 -0700
From: Matthew Reimer <mreimer@vpop.net>
Organization: VPOP Technologies, Inc.
X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.2.12 i386)
X-Accept-Language: en
MIME-Version: 1.0
To: stable@freebsd.org, scsi@freebsd.org
Subject: Possibly SCSI-related hang
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-stable@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

One of our boxes has been hanging, apparently in connection with some
SCSI parity errors which appear in the logs before the hangs. This
machine is running -stable from around March 4, with two disks da1 and
da2 comprising a vinum mirror. The parity errors all pertain to da2,
which is on its own controller:

ahc0: <Adaptec aic7896/97 Ultra2 SCSI adapter> port 0x1400-0x14ff mem
0xf4100000-0xf4100fff irq 11 at device 12.0 on pci0
aic7896/97: Wide Channel A, SCSI Id=7, 32/255 SCBs
ahc1: <Adaptec aic7896/97 Ultra2 SCSI adapter> port 0x1800-0x18ff mem
0xf4101000-0xf4101fff irq 11 at device 12.1 on pci0
aic7896/97: Wide Channel B, SCSI Id=7, 32/255 SCBs
fxp0: <Intel Pro 10/100B/100+ Ethernet> port 0x1080-0x10bf mem
0xf4000000-0xf40fffff,0xf4102000-0xf4102fff irq 10 at device 14.0 on
pci0
fxp0: Ethernet address 00:a0:c9:fb:26:e6
ahc2: <Adaptec 2940 SCSI adapter> port 0x2000-0x20ff mem
0xf4103000-0xf4103fff irq 5 at device 16.0 on pci0
aic7870: Wide Channel A, SCSI Id=7, 16/255 SCBs
...
sa0 at ahc2 bus 0 target 0 lun 0
sa0: <Quantum DLT4000 D069> Removable Sequential Access SCSI-2 device 
sa0: 10.000MB/s transfers (10.000MHz, offset 15)
da1 at ahc0 bus 0 target 1 lun 0
da1: <IBM DMVS36V 0100> Fixed Direct Access SCSI-3 device 
da1: 40.000MB/s transfers (20.000MHz, offset 31, 16bit), Tagged Queueing
Enabled
da1: 35003MB (71687340 512 byte sectors: 255H 63S/T 4462C)
da0 at ahc0 bus 0 target 0 lun 0
da0: <IBM DNES-309170W SA30> Fixed Direct Access SCSI-3 device 
da0: 40.000MB/s transfers (20.000MHz, offset 31, 16bit), Tagged Queueing
Enabled
da0: 8748MB (17916240 512 byte sectors: 255H 63S/T 1115C)
ch0 at ahc2 bus 0 target 1 lun 0
ch0: <HP C6280-4000 2.04> Removable Changer SCSI-2 device 
ch0: 3.300MB/s transfers
ch0: 8 slots, 1 drive, 1 picker, 0 portals
da2 at ahc1 bus 0 target 2 lun 0
da2: <IBM DMVS36V 0100> Fixed Direct Access SCSI-3 device 
da2: 40.000MB/s transfers (20.000MHz, offset 31, 16bit), Tagged Queueing
Enabled
da2: 35003MB (71687340 512 byte sectors: 255H 63S/T 4462C)


When the box hangs, we are able to break into the debugger; below are
two traces I got while the machine was hung (trace, continue, trace):

db> trace
siointr1(c164d400,ce08df64,c020ef46,c164d400,ce080010) at siointr1+0xf2
siointr(c164d400,ce080010,1887,2,2) at siointr+0xb
Xfastintr4(c163f400,81) at Xfastintr4+0x16
ahc_platform_intr(c163f400,6c0820,4faf8,bfbffdc0,bfbfc78c) at
ahc_platform_intr+
0x11e
intr_mux(c0a35960,0,2f,2f,2f) at intr_mux+0x1d
Xresume11() at Xresume11+0x2b
--- interrupt, eip = 0x2828836f, esp = 0xce08dfe0, ebp = 0xbfbfc78c ---

db> trace
siointr1(c164d400,ce08df18,c020ef46,c164d400,10) at siointr1+0xf2
siointr(c164d400,10,183e,27,0) at siointr+0xb
Xfastintr4(c163f400,ce08df48) at Xfastintr4+0x16
ahc_handle_seqint(c163f400,81) at ahc_handle_seqint+0x19
ahc_platform_intr(c163f400,6c0820,4faf8,bfbffdc0,bfbfc78c) at
ahc_platform_intr+
0x11e
intr_mux(c0a35960,0,2f,2f,2f) at intr_mux+0x1d
Xresume11() at Xresume11+0x2b
--- interrupt, eip = 0x2828836f, esp = 0xce08dfe0, ebp = 0xbfbfc78c 

Are these traces consistent with the recently fixed vinum deadlock?

Below are the errors that appeared in the logs before the hangs:

Apr  2 09:06:39 ring /kernel: (da2:ahc1:0:2:0): WRITE(06). CDB: a 5 1 88
10 0 
Apr  2 09:06:39 ring /kernel: (da2:ahc1:0:2:0): ABORTED COMMAND asc:47,0
Apr  2 09:06:39 ring /kernel: (da2:ahc1:0:2:0): SCSI parity error
Apr  2 09:06:41 ring /kernel: (da2:ahc1:0:2:0): WRITE(10). CDB: 2a 0 2
4b 3 b8 0 0 10 0 
Apr  2 09:06:41 ring /kernel: (da2:ahc1:0:2:0): ABORTED COMMAND asc:47,0
Apr  2 09:06:41 ring /kernel: (da2:ahc1:0:2:0): SCSI parity error
Apr  2 09:06:42 ring /kernel: (da2:ahc1:0:2:0): WRITE(10). CDB: 2a 0 0
bd 1 a8 0 0 10 0 
Apr  2 09:06:42 ring /kernel: (da2:ahc1:0:2:0): ABORTED COMMAND asc:47,0
Apr  2 09:06:42 ring /kernel: (da2:ahc1:0:2:0): SCSI parity error
Apr  2 09:07:10 ring /kernel: (da2:ahc1:0:2:0): WRITE(10). CDB: 2a 0 0
c2 1 88 0 0 10 0 
Apr  2 09:07:10 ring /kernel: (da2:ahc1:0:2:0): ABORTED COMMAND asc:47,0
Apr  2 09:07:10 ring /kernel: (da2:ahc1:0:2:0): SCSI parity error
Apr  2 09:07:11 ring /kernel: (da2:ahc1:0:2:0): WRITE(10). CDB: 2a 0 1
28 1 e8 0 0 10 0 
Apr  2 09:07:11 ring /kernel: (da2:ahc1:0:2:0): ABORTED COMMAND asc:47,0
Apr  2 09:07:11 ring /kernel: (da2:ahc1:0:2:0): SCSI parity error
Apr  2 09:07:13 ring /kernel: (da2:ahc1:0:2:0): WRITE(10). CDB: 2a 0 0
bd 3b 38 0 0 80 0 
Apr  2 09:07:13 ring /kernel: (da2:ahc1:0:2:0): ABORTED COMMAND asc:47,0
Apr  2 09:07:13 ring /kernel: (da2:ahc1:0:2:0): SCSI parity error
Apr  2 09:07:24 ring /kernel: (da2:ahc1:0:2:0): WRITE(10). CDB: 2a 0 0
bd 3c 88 0 0 80 0 
Apr  2 09:07:24 ring /kernel: (da2:ahc1:0:2:0): ABORTED COMMAND asc:47,0
Apr  2 09:07:24 ring /kernel: (da2:ahc1:0:2:0): SCSI parity error

<hang, reboot>

Apr  2 09:50:03 ring /kernel: (da2:ahc1:0:2:0): WRITE(10). CDB: 2a 0 0
38 1 88 0 0 10 0 
Apr  2 09:50:03 ring /kernel: (da2:ahc1:0:2:0): ABORTED COMMAND asc:47,0
Apr  2 09:50:03 ring /kernel: (da2:ahc1:0:2:0): SCSI parity error
Apr  2 09:56:04 ring /kernel: (da2:ahc1:0:2:0): parity error detected in
Data-in phase. SEQADDR(0x8a) SCSI
Apr  2 09:56:05 ring /kernel: (da2:ahc1:0:2:0): READ(10). CDB: 28 0 0 bc
4a 82 0 0 4 0 
Apr  2 09:56:05 ring /kernel: (da2:ahc1:0:2:0): ABORTED COMMAND asc:48,0
Apr  2 09:56:05 ring /kernel: (da2:ahc1:0:2:0): Initiator detected error
message received
Apr  2 10:01:04 ring /kernel: (da2:ahc1:0:2:0): parity error detected in
Data-in phase. SEQADDR(0x89) SCSI
Apr  2 10:01:04 ring /kernel: (da2:ahc1:0:2:0): READ(10). CDB: 28 0 2 e8
3e 48 0 0 80 0 
Apr  2 10:01:04 ring /kernel: (da2:ahc1:0:2:0): ABORTED COMMAND asc:48,0
Apr  2 10:01:04 ring /kernel: (da2:ahc1:0:2:0): Initiator detected error
message received

<hang, reboot>

Apr  2 12:19:40 ring /kernel: (da2:ahc1:0:2:0): WRITE(10). CDB: 2a 0 3
89 1 d8 0 0 10 0 
Apr  2 12:19:40 ring /kernel: (da2:ahc1:0:2:0): ABORTED COMMAND asc:47,0
Apr  2 12:19:40 ring /kernel: (da2:ahc1:0:2:0): SCSI parity error
Apr  2 12:25:03 ring /kernel: (da2:ahc1:0:2:0): WRITE(10). CDB: 2a 0 2
ce 1 e8 0 0 10 0 
Apr  2 12:25:03 ring /kernel: (da2:ahc1:0:2:0): ABORTED COMMAND asc:47,0
Apr  2 12:25:03 ring /kernel: (da2:ahc1:0:2:0): SCSI parity error
Apr  2 12:27:34 ring /kernel: (da2:ahc1:0:2:0): WRITE(10). CDB: 2a 0 0
bd 3e 48 0 0 80 0 
Apr  2 12:27:34 ring /kernel: (da2:ahc1:0:2:0): ABORTED COMMAND asc:47,0
Apr  2 12:27:34 ring /kernel: (da2:ahc1:0:2:0): SCSI parity error

At this point I told vinum not to use da2 any more (vinum stop -f
fatmirror.p1.s0); predictably, since then no parity errors have occurred
and the kernel has not hung.

I ran "dd bs=65536 if=/dev/da2e of=/dev/null" which completed
successfully, reading all 36G with no errors.

Questions:

1. Should the kernel hang because of parity errors?

2. What is causing the hang: a bug in the SCSI code, or vinum, or ...?

3. Do parity errors indicate SCSI bus problems?

4. Might these parity errors indicate that da2 is bad?

5. Is it possible for a bad SCSI device to affect other devices on the
bus? About eight months ago da2 used to be on the same controller with
da0 and da1, but da0 and da1 started having strange problems
("swap_pager: indefinite wait buffer" on da0 where swap is, and "timed
out while idle" on da1). Installing a new LVD cable didn't help, but
moving da2 to its own bus did (but now we get parity errors on da2's
bus).

Thanks for any help you can give.

Matt

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message