From owner-freebsd-bugs  Mon Jun 12 15:10:13 2000
Delivered-To: freebsd-bugs@freebsd.org
Received: from freefall.freebsd.org (freefall.FreeBSD.ORG [204.216.27.21])
	by hub.freebsd.org (Postfix) with ESMTP id 1044437B5AF
	for <freebsd-bugs@FreeBSD.org>; Mon, 12 Jun 2000 15:10:07 -0700 (PDT)
	(envelope-from gnats@FreeBSD.org)
Received: (from gnats@localhost)
	by freefall.freebsd.org (8.9.3/8.9.2) id PAA24298;
	Mon, 12 Jun 2000 15:10:06 -0700 (PDT)
	(envelope-from gnats@FreeBSD.org)
Date: Mon, 12 Jun 2000 15:10:06 -0700 (PDT)
Message-Id: <200006122210.PAA24298@freefall.freebsd.org>
To: freebsd-bugs@FreeBSD.org
Cc: 
From: "Kenneth D. Merry" <ken@kdm.org>
Subject: Re: i386/19226: SCSI timeouts during heavy load
Reply-To: "Kenneth D. Merry" <ken@kdm.org>
Sender: owner-freebsd-bugs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

The following reply was made to PR i386/19226; it has been noted by GNATS.

From: "Kenneth D. Merry" <ken@kdm.org>
To: gij@jk.priv.no
Cc: FreeBSD-gnats-submit@FreeBSD.ORG
Subject: Re: i386/19226: SCSI timeouts during heavy load
Date: Mon, 12 Jun 2000 15:59:26 -0600

 [ Please make sure to CC any response to freebsd-gnats-submit@FreeBSD.ORG
 so your repsonse makes it into the gnats database. ]
 
 On Mon, Jun 12, 2000 at 21:37:17 +0000, gij@jk.priv.no wrote:
 > 
 > >Number:         19226
 > >Category:       i386
 > >Synopsis:       SCSI timeouts during heavy load
 > >Confidential:   no
 > >Severity:       serious
 > >Priority:       high
 > >Responsible:    freebsd-bugs
 > >State:          open
 > >Quarter:        
 > >Keywords:       
 > >Date-Required:
 > >Class:          sw-bug
 > >Submitter-Id:   current-users
 > >Arrival-Date:   Mon Jun 12 14:40:01 PDT 2000
 > >Closed-Date:
 > >Last-Modified:
 > >Originator:     Geir Inge Jensen
 > >Release:        FreeBSD 4.0-STABLE i386
 > >Organization:
 > None, only personal opinions expressed.
 > >Environment:
 > 
 > Dell PowerEdge 2450 Dual 600MHz. Dell PowerVault 200S. Two AHA29160 
 > SCSI cards, both connected to the PowerVault.
 > 
 > 3 internal IBM DMVS 18GB disks. 8 external disks in the PowerVault
 > (same disks).
 > 
 > Relavant dmesg output:
 
 [ ... ]
 
 It would have probably been helpful to include the dmesg output from the
 disks as well, to get a better idea of the configuration.
 
 You've got two SCSI busses connected to the *same* array?  Is this
 controller a CMD OEM controller by any chance?
 
 > acd0: CDROM <TOSHIBA CD-ROM XM-7002B> at ata0-master using PIO4
 > pass2 at ahc2 bus 0 target 6 lun 0
 > pass2: <DELL 2x2 U2W SCSI BP 1.15> Fixed Processor SCSI-2 device
 > pass2: 3.300MB/s transfers
 > pass7 at ahc0 bus 0 target 15 lun 0
 > pass7: <Dell 8 BAY U2W CU 0203> Removable Processor SCSI-3 device
 > pass7: 3.300MB/s transfers
 > pass12 at ahc1 bus 0 target 15 lun 0
 > pass12: <Dell 8 BAY U2W CU 0203> Removable Processor SCSI-3 device
 > pass12: 3.300MB/s transfers
 > pass14 at ahc3 bus 0 target 6 lun 0
 > pass14: <DELL 2x2 U2W SCSI BP 1.15> Fixed Processor SCSI-2 device
 > pass14: 3.300MB/s transfers
 > 
 > >Description:
 > 
 > After a while, during heavy disk I/O, the following appears:
 > 
 > (da2:ahc0:0:0:0): SCB 0x33 - timed out while idle, SEQADDR == 0x157
 > (da2:ahc0:0:0:0): Queuing a BDR SCB
 > (da6:ahc1:0:8:0): SCB 0x7c - timed out while idle, SEQADDR == 0x157
 > (da6:ahc1:0:8:0): Queuing a BDR SCB
 > (da2:ahc0:0:0:0): SCB 0x33 - timed out while idle, SEQADDR == 0x157
 > (da2:ahc0:0:0:0): no longer in timeout, status = 34b
 > ahc0: Issued Channel A Bus Reset. 7 SCBs aborted
 > (da6:ahc1:0:8:0): SCB 0x7c - timed out while idle, SEQADDR == 0x157
 > (da6:ahc1:0:8:0): no longer in timeout, status = 34b
 > ahc1: Issued Channel A Bus Reset. 7 SCBs aborted
 > 
 > And so on. At this time, you don't have any contact with the PowerVault. 
 > Of course, the ccd freaks out with this:
 > 
 > ccd0: error 5 on component 0 block 80 (ccd block 64)
 
 That (the timeout messages) indicates that from the system's perspective,
 the array hasn't returned a read or write request in 60 seconds.  So we
 reset it in an attempt to wake it up.
 
 > Notice that the error occurs on both buses at the same time! It can
 > take several hours before this happens. But we can reproduce it with
 > some patience and heavy load. The SCB's differ slightly from occasion
 > to occasion.
 
 > This is what we have tried to pinpoint the cause:
 > 
 >  - Replace all scsi cables.
 >  - Terminate the bus'es in the bios.
 >  - Replace the AHA29160's with other AHA29160's.
 >  - Replace the AHA29160's with AHA2940U2W's.
 >  - Replace the internal PCI bus the cards plugs into (PCI tray).
 >  - Replace the ES Expander Modules in the PowerVault.
 >  - Replace the PowerVault.
 >  - Replace the PowerVault with a known good (and older revision) PowerVault 
 >    (we have several of these running on Dell PowerEdge 4350 with 
 >     3.3-STABLE on them). These older systems run fine.
 >  - Test with 4.0-STABLE UP kernel.
 >  - Test with 5.0-CURRENT UP kernel.
 >  - Keep both external SCSI cards, but use only one of them.
 >  - Remove one of the external SCSI cards, and use the internal 7899, 
 >    channel B, as well against the PowerVault (ie. two buses against it).
 >  - Running RedHat 6.2 with 2.2.14-5 kernel on the same system.
 >   
 > None of the above actions cured it. After some hours, it fails. Note that
 > the old PowerVault we tested from earlier systems contained other disks 
 > (Seagate and Quantum), which works fine under 3.3-STABLE.
 
 That's quite a lot of diagnosis.  Much better than most people who just say
 "it's broken". :)
 
 > >From this testing, we have these conclusions:
 > 
 >  - There is nothing wrong with the PowerVault and the diskdrives.
 >  - There is nothing wrong with the SCSI cards.
 > 
 > We also have some success stories:
 > 
 >  - Run the PowerVault from a single PCI card (ie. remove the other).
 >  - Run the PowerVault only from the internal 7899, channel B.
 
 In this configuration, did you have any other SCSI bus connected to the
 PowerVault?
 
 >  - linux-2.2.14-6.1.1 kernel (provided by Dell) with original HW setup.
 >  - linux-2.2.15 kernel with original HW setup.
 > 
 > To me, it sounds like a PCI problem (or maybe in the RCC LE chip). It
 > could also be a problem in the AIC7xxx driver, but it even failed with
 > the AHA2940U2W cards (which works fine in our 3.3 systems). But I am
 > only guessing here. However, Linux has obviously found a fix. 
 
 I kinda wonder if this RAID array may be a CMD OEM or something.
 
 CMD controllers have trouble when you have multiple luns on the same
 controller in use.  The symptoms are very similar to what you're
 describing.
 
 The two 'solutions' for a CMD controller are:
  - only use one LUN
  - disable tagged queueing for both luns (you can do this either from CMD's
    setup utility or from FreeBSD with camcontrol, or by putting a quirk
    entry in the transport layer.)
 
 > >How-To-Repeat:
 > 
 > Access every disk in the system, and produce a lot of I/O. I open all
 > disk devices in raw mode and do a lot of random seeks and reads.
 > However, we have experienced this error on mostly idle machines also.
 
 Except for the idle part, this sounds kinda like the CMD problem.
 
 One thing to try is disabling tagged queueing on both ports of the array.
 For example, to disable tagged queueing for the disk da20:
 
 camcontrol negotiate da20 -v -T disable -a
 
 Then try running your tests again, and see if the problem happens again.
 If so, it may be that the array has problems with tagged queueing on
 multiple luns, like the CMD array controllers.
 
 Ken
 -- 
 Kenneth Merry
 ken@kdm.org
 


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-bugs" in the body of the message