From owner-freebsd-stable Wed Apr 4 11:49:12 2001 Delivered-To: freebsd-stable@freebsd.org Received: from boromir.vpop.net (dns1.vpop.net [206.117.147.2]) by hub.freebsd.org (Postfix) with ESMTP id 2255F37B71F; Wed, 4 Apr 2001 11:48:56 -0700 (PDT) (envelope-from mreimer@vpop.net) Received: from vpop.net ([209.102.16.48]) by boromir.vpop.net (8.11.2/8.11.2) with ESMTP id f34Imln72065; Wed, 4 Apr 2001 11:48:47 -0700 (PDT) (envelope-from mreimer@vpop.net) Message-ID: <3ACB6C90.750FF100@vpop.net> Date: Wed, 04 Apr 2001 11:48:48 -0700 From: Matthew Reimer Organization: VPOP Technologies, Inc. X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.2.12 i386) X-Accept-Language: en MIME-Version: 1.0 To: stable@freebsd.org, scsi@freebsd.org Subject: Possibly SCSI-related hang Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-stable@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG One of our boxes has been hanging, apparently in connection with some SCSI parity errors which appear in the logs before the hangs. This machine is running -stable from around March 4, with two disks da1 and da2 comprising a vinum mirror. The parity errors all pertain to da2, which is on its own controller: ahc0: port 0x1400-0x14ff mem 0xf4100000-0xf4100fff irq 11 at device 12.0 on pci0 aic7896/97: Wide Channel A, SCSI Id=7, 32/255 SCBs ahc1: port 0x1800-0x18ff mem 0xf4101000-0xf4101fff irq 11 at device 12.1 on pci0 aic7896/97: Wide Channel B, SCSI Id=7, 32/255 SCBs fxp0: port 0x1080-0x10bf mem 0xf4000000-0xf40fffff,0xf4102000-0xf4102fff irq 10 at device 14.0 on pci0 fxp0: Ethernet address 00:a0:c9:fb:26:e6 ahc2: port 0x2000-0x20ff mem 0xf4103000-0xf4103fff irq 5 at device 16.0 on pci0 aic7870: Wide Channel A, SCSI Id=7, 16/255 SCBs ... sa0 at ahc2 bus 0 target 0 lun 0 sa0: Removable Sequential Access SCSI-2 device sa0: 10.000MB/s transfers (10.000MHz, offset 15) da1 at ahc0 bus 0 target 1 lun 0 da1: Fixed Direct Access SCSI-3 device da1: 40.000MB/s transfers (20.000MHz, offset 31, 16bit), Tagged Queueing Enabled da1: 35003MB (71687340 512 byte sectors: 255H 63S/T 4462C) da0 at ahc0 bus 0 target 0 lun 0 da0: Fixed Direct Access SCSI-3 device da0: 40.000MB/s transfers (20.000MHz, offset 31, 16bit), Tagged Queueing Enabled da0: 8748MB (17916240 512 byte sectors: 255H 63S/T 1115C) ch0 at ahc2 bus 0 target 1 lun 0 ch0: Removable Changer SCSI-2 device ch0: 3.300MB/s transfers ch0: 8 slots, 1 drive, 1 picker, 0 portals da2 at ahc1 bus 0 target 2 lun 0 da2: Fixed Direct Access SCSI-3 device da2: 40.000MB/s transfers (20.000MHz, offset 31, 16bit), Tagged Queueing Enabled da2: 35003MB (71687340 512 byte sectors: 255H 63S/T 4462C) When the box hangs, we are able to break into the debugger; below are two traces I got while the machine was hung (trace, continue, trace): db> trace siointr1(c164d400,ce08df64,c020ef46,c164d400,ce080010) at siointr1+0xf2 siointr(c164d400,ce080010,1887,2,2) at siointr+0xb Xfastintr4(c163f400,81) at Xfastintr4+0x16 ahc_platform_intr(c163f400,6c0820,4faf8,bfbffdc0,bfbfc78c) at ahc_platform_intr+ 0x11e intr_mux(c0a35960,0,2f,2f,2f) at intr_mux+0x1d Xresume11() at Xresume11+0x2b --- interrupt, eip = 0x2828836f, esp = 0xce08dfe0, ebp = 0xbfbfc78c --- db> trace siointr1(c164d400,ce08df18,c020ef46,c164d400,10) at siointr1+0xf2 siointr(c164d400,10,183e,27,0) at siointr+0xb Xfastintr4(c163f400,ce08df48) at Xfastintr4+0x16 ahc_handle_seqint(c163f400,81) at ahc_handle_seqint+0x19 ahc_platform_intr(c163f400,6c0820,4faf8,bfbffdc0,bfbfc78c) at ahc_platform_intr+ 0x11e intr_mux(c0a35960,0,2f,2f,2f) at intr_mux+0x1d Xresume11() at Xresume11+0x2b --- interrupt, eip = 0x2828836f, esp = 0xce08dfe0, ebp = 0xbfbfc78c Are these traces consistent with the recently fixed vinum deadlock? Below are the errors that appeared in the logs before the hangs: Apr 2 09:06:39 ring /kernel: (da2:ahc1:0:2:0): WRITE(06). CDB: a 5 1 88 10 0 Apr 2 09:06:39 ring /kernel: (da2:ahc1:0:2:0): ABORTED COMMAND asc:47,0 Apr 2 09:06:39 ring /kernel: (da2:ahc1:0:2:0): SCSI parity error Apr 2 09:06:41 ring /kernel: (da2:ahc1:0:2:0): WRITE(10). CDB: 2a 0 2 4b 3 b8 0 0 10 0 Apr 2 09:06:41 ring /kernel: (da2:ahc1:0:2:0): ABORTED COMMAND asc:47,0 Apr 2 09:06:41 ring /kernel: (da2:ahc1:0:2:0): SCSI parity error Apr 2 09:06:42 ring /kernel: (da2:ahc1:0:2:0): WRITE(10). CDB: 2a 0 0 bd 1 a8 0 0 10 0 Apr 2 09:06:42 ring /kernel: (da2:ahc1:0:2:0): ABORTED COMMAND asc:47,0 Apr 2 09:06:42 ring /kernel: (da2:ahc1:0:2:0): SCSI parity error Apr 2 09:07:10 ring /kernel: (da2:ahc1:0:2:0): WRITE(10). CDB: 2a 0 0 c2 1 88 0 0 10 0 Apr 2 09:07:10 ring /kernel: (da2:ahc1:0:2:0): ABORTED COMMAND asc:47,0 Apr 2 09:07:10 ring /kernel: (da2:ahc1:0:2:0): SCSI parity error Apr 2 09:07:11 ring /kernel: (da2:ahc1:0:2:0): WRITE(10). CDB: 2a 0 1 28 1 e8 0 0 10 0 Apr 2 09:07:11 ring /kernel: (da2:ahc1:0:2:0): ABORTED COMMAND asc:47,0 Apr 2 09:07:11 ring /kernel: (da2:ahc1:0:2:0): SCSI parity error Apr 2 09:07:13 ring /kernel: (da2:ahc1:0:2:0): WRITE(10). CDB: 2a 0 0 bd 3b 38 0 0 80 0 Apr 2 09:07:13 ring /kernel: (da2:ahc1:0:2:0): ABORTED COMMAND asc:47,0 Apr 2 09:07:13 ring /kernel: (da2:ahc1:0:2:0): SCSI parity error Apr 2 09:07:24 ring /kernel: (da2:ahc1:0:2:0): WRITE(10). CDB: 2a 0 0 bd 3c 88 0 0 80 0 Apr 2 09:07:24 ring /kernel: (da2:ahc1:0:2:0): ABORTED COMMAND asc:47,0 Apr 2 09:07:24 ring /kernel: (da2:ahc1:0:2:0): SCSI parity error Apr 2 09:50:03 ring /kernel: (da2:ahc1:0:2:0): WRITE(10). CDB: 2a 0 0 38 1 88 0 0 10 0 Apr 2 09:50:03 ring /kernel: (da2:ahc1:0:2:0): ABORTED COMMAND asc:47,0 Apr 2 09:50:03 ring /kernel: (da2:ahc1:0:2:0): SCSI parity error Apr 2 09:56:04 ring /kernel: (da2:ahc1:0:2:0): parity error detected in Data-in phase. SEQADDR(0x8a) SCSI Apr 2 09:56:05 ring /kernel: (da2:ahc1:0:2:0): READ(10). CDB: 28 0 0 bc 4a 82 0 0 4 0 Apr 2 09:56:05 ring /kernel: (da2:ahc1:0:2:0): ABORTED COMMAND asc:48,0 Apr 2 09:56:05 ring /kernel: (da2:ahc1:0:2:0): Initiator detected error message received Apr 2 10:01:04 ring /kernel: (da2:ahc1:0:2:0): parity error detected in Data-in phase. SEQADDR(0x89) SCSI Apr 2 10:01:04 ring /kernel: (da2:ahc1:0:2:0): READ(10). CDB: 28 0 2 e8 3e 48 0 0 80 0 Apr 2 10:01:04 ring /kernel: (da2:ahc1:0:2:0): ABORTED COMMAND asc:48,0 Apr 2 10:01:04 ring /kernel: (da2:ahc1:0:2:0): Initiator detected error message received Apr 2 12:19:40 ring /kernel: (da2:ahc1:0:2:0): WRITE(10). CDB: 2a 0 3 89 1 d8 0 0 10 0 Apr 2 12:19:40 ring /kernel: (da2:ahc1:0:2:0): ABORTED COMMAND asc:47,0 Apr 2 12:19:40 ring /kernel: (da2:ahc1:0:2:0): SCSI parity error Apr 2 12:25:03 ring /kernel: (da2:ahc1:0:2:0): WRITE(10). CDB: 2a 0 2 ce 1 e8 0 0 10 0 Apr 2 12:25:03 ring /kernel: (da2:ahc1:0:2:0): ABORTED COMMAND asc:47,0 Apr 2 12:25:03 ring /kernel: (da2:ahc1:0:2:0): SCSI parity error Apr 2 12:27:34 ring /kernel: (da2:ahc1:0:2:0): WRITE(10). CDB: 2a 0 0 bd 3e 48 0 0 80 0 Apr 2 12:27:34 ring /kernel: (da2:ahc1:0:2:0): ABORTED COMMAND asc:47,0 Apr 2 12:27:34 ring /kernel: (da2:ahc1:0:2:0): SCSI parity error At this point I told vinum not to use da2 any more (vinum stop -f fatmirror.p1.s0); predictably, since then no parity errors have occurred and the kernel has not hung. I ran "dd bs=65536 if=/dev/da2e of=/dev/null" which completed successfully, reading all 36G with no errors. Questions: 1. Should the kernel hang because of parity errors? 2. What is causing the hang: a bug in the SCSI code, or vinum, or ...? 3. Do parity errors indicate SCSI bus problems? 4. Might these parity errors indicate that da2 is bad? 5. Is it possible for a bad SCSI device to affect other devices on the bus? About eight months ago da2 used to be on the same controller with da0 and da1, but da0 and da1 started having strange problems ("swap_pager: indefinite wait buffer" on da0 where swap is, and "timed out while idle" on da1). Installing a new LVD cable didn't help, but moving da2 to its own bus did (but now we get parity errors on da2's bus). Thanks for any help you can give. Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-stable" in the body of the message