Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 6 Dec 1997 14:23:05 -0600
From:      Tim Tsai <tim@futuresouth.com>
To:        "Bryn Wm. Moslow" <bryn@nwlink.com>
Cc:        freebsd-isp@FreeBSD.ORG, jayk@nwlink.com
Subject:   Re: Adaptec 2940/Seagate Failures
Message-ID:  <19971206142305.28932@futuresouth.com>
In-Reply-To: <Pine.GSO.3.95.971205120024.28222B-100000@utah>; from Bryn Wm. Moslow on Fri, Dec 05, 1997 at 02:42:08PM -0800
References:  <Pine.GSO.3.95.971205120024.28222B-100000@utah>

next in thread | previous in thread | raw e-mail | index | archive | help
Bryn, we recently went through a similar set of problems on our news
server and the following steps have solved the problem for us.  We had
nearly identical error messages as you, by the way.

Our news server started with 9 SCSI hard disks (a combination of IBM and
Quantum drives) across 3 controllers (2 Adaptec 2940UW and an NCR).
While this setup would crash every now and then, it was tolerable and we
only have to manually recover it once every month or so.  The machine
initially had 2.2-current but I upgraded to 3.0-current hoping it would
fix the periodic problem.  It didn't help at all.  When I had to pull
the NCR controller out for another machine in an emergency everything
started to fall apart.  We would get the same type of messages as you
did requiring a manual "fsck" each time it crashed.  It was rare that it
ran for more than 3 days without crashing, usually during news.daily or
network backup, or both.  So this was obviously I/O load related.

After checking cables, termination, software, etc. we finally came to an
arrangement that has been super stable for us:

1) Enable SCAM support on all controllers.  I don't know whether this
helped anything or not but somebody on the list advocated it and it
hasn't hurt us so I left it.
2) Disabled all the AHC_* options.  For our hardware setup and load these
options haven't been necessary.  I am sure we lose some performance but it
is not noticeable.  When I am bored I may enable them again just to see
what happens.
3) I ran find and looked for all the files outside of /dev that either
does not belong to a user (-nouser), does not belong to a group (-nogroup),
or is a character/block special device (-c -or -b).  I found that fsck
doesn't always clean up the file system as it should and left some of
these special case files (usually with a crazy user id like 6123462).
When these files are subsequently accessed, the machine crashed
(obviously).  You should first get a list of these files and inode numbers,
take the machine to single user mode, run "clri" on these inode numbers,
and then run fsck again.
4) Upgrade your motherboard to the latest BIOS.

This completely cleared up our problems.  I feel that if you do at least
#1 and #3 it should fix your problems, barring any hardware issues.

BTW, we are still running mixed wide/narrow drives on the same controller.
We've never had a problem with that, although the narrow drives go to
the narrow connector on the controller and vice versa, without any kind of
adapters.

Hope that helps,

Tim

On Fri, Dec 05, 1997 at 02:42:08PM -0800, Bryn Wm. Moslow wrote:
> From: "Bryn Wm. Moslow" <bryn@nwlink.com>
> To: freebsd-isp@FreeBSD.ORG
> cc: jayk@nwlink.com
> Subject: Adaptec 2940/Seagate Failures
> Message-ID: <Pine.GSO.3.95.971205120024.28222B-100000@utah>
> MIME-Version: 1.0
> Content-Type: TEXT/PLAIN; charset=US-ASCII
> Sender: owner-freebsd-isp@FreeBSD.ORG
> X-Loop: FreeBSD.org
> Precedence: bulk

> Hello, sorry that this is a bit windy but I'm desperate:
> 
> I'm still having big problems with FreeBSD, the Adaptec 2940UW, and
> Seagate Drives. When the system gets heavily loaded (i.e. 65-75 sendmail
> processes, 20 or so poppers,) often it comes to a complete stop and sure
> enough I can get to the console and discover just about the same thing
> every time (which is at the bottom of this message along with my dmesg
> output for informational purposes.) I've tried FreeBSD's both 2.2.2 and
> 2.2.5, sendmail 8.8.5, 8.8.6, 8.8.7, 8.8.8, qpopper 2.3, 2.4. I would like
> to note that disk I/O was much smoother and system load was significantly
> lower with 2.2.5 but it was only two hours under load before it pooped the
> first time as opposed to a couple days under 2.2.2. In fact, it was odd
> because the load was 0.7 and iostat was about 3800sps on sd2 average when
> the most recent death (see 'the hell' below) occured.
> 
> I've been reading the long debate about the 2940 and FreeBSD for some time
> and just today went through my whole archive of freebsd-isp and noted some
> things. What especially stands out is the number of people saying that
> they have no problem, "it works great," and then noting that they're not
> really using it or only have a tape attached, etc., literally in the same
> breath.  The people who DO seem to be having problems are running the 2940
> under heavy load conditions and having to power cycle servers at horrible
> times like myself. If you have an archive of freebsd-isp do a search on
> "Adaptec 2940" and you'll see what I'm talking about. Just an observation
> and opinion: I'm not trying to PO anyone but I think there has to be more
> attention paid to the stability of the SCSI subsystem, specifically under
> heavy loads. Once again: I love you all, I love Chuck, please help me ;).
> 
> Notes:  
> 
> - I've used every combination of AHC_TAGENABLE, AHC_SCBPAGING_ENABLE, and
> AHC_ALLOW_MEMIO in the kernel possible and each one in cooperation with
> the others or on its own eventually brings down the system. 
> 
> - We've broken out the mail spool for local mail to a directory structure
> based on the first letters of username such as: /var/mail/u/us/username.
> This has helped overall but iostat still hits the roof when people get
> lots of mail (lists, spam) and pop3 is yanking down large mail files.
> 
> - per advice from other FreeBSD users and non-FreeBSD users and an
> electrical engineer, the narrow drive on a separate controller from the
> wide drives.
> 
> - The drives are all internal, the bus is terminated and the cable is only
> 0.5m. 
> 
> - The controller is in Ultra mode and I would like to keep it there if
> possible. I've tried it without to no avail anyway. I'm quite sure this
> should not be a problem as I have a BSDI 3.0 box running news that does at
> least ten times the I/O per day at a higher load on two Adaptec 2940's and
> 10 drives in Ultra mode with a 4-disk ccd (sp0 in BSDI) with the same CPU,
> but I want to believe in the power of FreeBSD. :) 
> 
> The hell: (This particular kernel was with AHC_SCBPAGING_ENABLE but I get 
> similar results with the other options and ultimately a bus failure
> and/or lockup and/or panic. I don't have as much trouble with 
> no extra ahc options but the system gets VERY s-l-o-w under load.)
>
> SSTAT1 = 0x3
> sd2(ahc1:2:0): abort message in message buffer
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xf6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): no longer in timeout
> ahc1: Issued Channel A Bus Reset. 4 SCBs aborted
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xe6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): abort message in message buffer
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xf6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): no longer in timeout
> ahc1: Issued Channel A Bus Reset. 4 SCBs aborted
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xe6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): abort message in message buffer
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xf6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): no longer in timeout
> ahc1: Issued Channel A Bus Reset. 4 SCBs aborted
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xe6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): abort message in message buffer
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xf6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): no longer in timeout
> ahc1: Issued \M^?\^OA Bus Reset. 4 SCBs aborted
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xe6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): abort message in message buffer
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xf6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): no longer in timeout
> ahc1: Issued Channel A Bus Reset. 4 SCBs aborted
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xe6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): abort message in message buffer
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xf6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): no longer in timeout
> ahc1: Issued Channel A Bus Reset. 4 SCBs aborted
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xe6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): abort message in message buffer
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xf6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): no longer in timeout
> ahc1: Issued Channel A Bus Reset. 4 SCBs aborted
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xe6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): abort message in message buffer
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xf6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): no longer in timeout
> ahc1: Issued Channel A Bus Reset. 4 SCBs aborted
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xe6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): abort message in message buffer
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xf6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): no longer in timeout
> ahc1: Issued Channel A Bus Reset. 4 SCBs aborted
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xe6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): abort message in message buffer
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xf6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): no longer in timeout
> ahc1: Issued Channel A Bus Reset. 4 SCBs aborted
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xe6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): abort message in message buffer
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xf6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): no longer in timeout
> ahc1: Issued Channel A Bus Reset. 4 SCBs aborted
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xe6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): abort message in message buffer
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xf6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): no longer in timeout
> ahc1: Issued Channel A Bus Reset. 4 SCBs aborted
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xe6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): abort message in message buffer
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xf6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): no longer in timeout
> ahc1: Issued Channel A Bus Reset. 4 SCBs aborted
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xe6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): abort message in message buffer
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xf6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): no longer in timeout
> ahc1: Issued Channel A Bus Reset. 4 SCBs aborted
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xe6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): abort message in message buffer
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xf6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): no longer in timeout
> ahc1: Issued Channel A Bus Reset. 4 SCBs aborted
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xe6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): abort message in message buffer
> sd2(ahc1:2:0): SCB 0x0 - timed out in message in phase, SCSISIGI == 0xf6
> SEQADDR = 0xd1 SCSISEQ = 0x12 SSTAT0 = 0x2 SSTAT1 = 0x3
> sd2(ahc1:2:0): no longer in timeout
> ahc1: Issued Channel A Bus Reset. 4 SCBs aborted
> 
> dmesg output: (No ahc kernel options)
> 
> FreeBSD 2.2.2-RELEASE #0: Wed Nov 19 13:31:09 PST 1997
>     bryn@alabama.nwlink.com:/usr/src/sys/compile/ALABAMA
> CPU: Pentium Pro (199.43-MHz 686-class CPU)
>   Origin = "GenuineIntel"  Id = 0x619  Stepping=9
>   Features=0xfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,<b11>,MTRR,PGE,MCA,CMOV>
> real memory  = 268435456 (262144K bytes)
> avail memory = 257245184 (251216K bytes)
> Probing for devices on PCI bus 0:
> chip0 <Intel 82440FX (Natoma) PCI and memory controller> rev 2 on pci0:0
> chip1 <Intel 82371SB PCI-ISA bridge> rev 1 on pci0:1:0
> chip2 <Intel 82371SB IDE interface> rev 0 on pci0:1:1
> vx0 <3COM 3C905 Fast Etherlink XL PCI> rev 0 int a irq 12 on pci0:9
> mii[*mii*] address 00:60:08:0a:42:32
> ahc0 <Adaptec 2940 Ultra SCSI host adapter> rev 0 int a irq 10 on pci0:10
> ahc0: aic7880 Wide Channel, SCSI Id=7, 16 SCBs
> ahc0 waiting for scsi devices to settle
> (ahc0:0:0): "SEAGATE ST52160N 0285" type 0 fixed SCSI 2
> sd0(ahc0:0:0): Direct-Access 2069MB (4238282 512 byte sectors)
> vga0 <VGA-compatible display device> rev 0 on pci0:11
> ahc1 <Adaptec 2940 Ultra SCSI host adapter> rev 0 int a irq 11 on pci0:12
> ahc1: aic7880 Wide Channel, SCSI Id=7, 16 SCBs
> ahc1 waiting for scsi devices to settle
> (ahc1:1:0): "SEAGATE ST34572W 0718" type 0 fixed SCSI 2
> sd1(ahc1:1:0): Direct-Access 4340MB (8888924 512 byte sectors)
> (ahc1:2:0): "SEAGATE ST34572W 0784" type 0 fixed SCSI 2
> sd2(ahc1:2:0): Direct-Access 4340MB (8888924 512 byte sectors)
> Probing for devices on the ISA bus:
> sc0 at 0x60-0x6f irq 1 on motherboard
> sc0: VGA color <16 virtual consoles, flags=0x0>
> sio0 at 0x3f8-0x3ff irq 4 on isa
> sio0: type 16550A
> sio1 at 0x2f8-0x2ff irq 3 on isa
> sio1: type 16550A
> lpt0 at 0x378-0x37f irq 7 on isa
> lpt0: Interrupt-driven port
> lp0: TCP/IP capable interface
> fdc0 at 0x3f0-0x3f7 irq 6 drq 2 on isa
> fdc0: NEC 72065B
> fd0: 1.44MB 3.5in
> npx0 flags 0x1 on motherboard
> npx0: INT 16 interface
> WARNING: / was not properly dismounted.
> 
> Thanks for your time,
> Bryn



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19971206142305.28932>