From owner-freebsd-current  Tue Sep 23 21:57:26 1997
Return-Path: <owner-freebsd-current>
Received: (from root@localhost)
          by hub.freebsd.org (8.8.7/8.8.7) id VAA00560
          for current-outgoing; Tue, 23 Sep 1997 21:57:26 -0700 (PDT)
Received: from nemesis.lonestar.org ([204.178.74.200])
          by hub.freebsd.org (8.8.7/8.8.7) with SMTP id VAA00554
          for <current>; Tue, 23 Sep 1997 21:57:21 -0700 (PDT)
Received: by nemesis.lonestar.org (Smail3.1.27.1 #22)
	id m0xDjU6-000twnC; Tue, 23 Sep 97 23:55 CDT
Message-Id: <m0xDjU6-000twnC@nemesis.lonestar.org>
Date: Tue, 23 Sep 97 23:55 CDT
To: current@freebsd.org
From: uhclem@nemesis.lonestar.org (Frank Durda IV)
Sent: Tue Sep 23 1997, 23:55:34 CDT
Subject: Is the SCSI "timed out while idle" bug supposed to be fixed?
Cc: uhclem.ds3@nemesis.lonestar.org
Sender: owner-freebsd-current@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

With all of the recent shift to tinkering with the kernel callout
architecture, I must ask if everybody believes the "timed out while idle"
class of SCSI errors have been fixed.

The reason I ask is that I am still getting them, every few hours, on
three different boxes, using the 2.2.2-RELENG-970921 kernel tree.
The most recent (copied off the screen a few minutes ago since the system
usually dies for good when this happens)

sd3(ahc0:3:0) SCB 0x3 - timed out while idle, LASTPHASE ==0x01, SCSISIGI==0x00, 
sd3(ahc0:3:0) SEQADDR=0x6 SCSISEQ=0x12 SSTAT=0x05  SSTAT1=0xA
sd3(ahc0:3:0) Queuing an Abort SCB
sd3(ahc0:3:0) Abort Message Sent
sd3(ahc0:3:0) SCB 3 - Abort Completed
sd3(ahc0:3:0) no longer in timeout
----and the system dies here---

The system is still pingable, sometimes fools SNMP monitors and will even
open a telnet session that hangs, but the system is not usable and there
is no subsequent disk activity.

I've been getting these for two months now, and have updated the kernel
a couple times when I saw what appeared to be changes that purported to
fix the problem go into the tree.  The volume of "events" has gone down
slightly on the newer kernels, and sometimes it is able to recover and keep
going (although apps start dumping core, indicating to me that things
didn't really come back sane), but they have not gone away.

The hardware in question is three different Pentium 133 systems,
all three with Intel motherboards of the same make.

All systems have 128Meg of RAM and a cache module.
The kernel is as distributed, except for the inclusion of the CCD
driver, and the insertion of full duplex ethernet code for the DE
driver (amazing this isn't in the distribution yet).   Most non-present
peripherals have their drivers omitted from the kernel config.
(Config file available on request.)

All three systems have four SCSI drives.
With one exception on one system, all four drives are Quantum Fireball 
4.3GB drives.   SCSI controllers are Adaptec 2940U or 2940s.
Drives are configured as logical units 0-3 and there are no tapes,
CD-ROMs or other peripherals, apart from one 1.44Meg floppy.  The one
odd-system has a Seagate Baracuda for Drive 0.  Drives 1 thru 3 are always
Quantum 4.3GBs.

Of the drives the SCSI driver decides to blame when an event occurs,
100% of the time it is a Quantum, almost always it complains
about a problem with drive two or three.   If you take drives two and
three and put them at different positions on the SCSI cable (correcting
termination), the next time you get an error, it will still complain
about drive two or three.   If you replace drives two and three with
brand new drives, you will eventually get errors again, blaming drive
two of three, mainly drive 3.

Once in a while, it spits out not ready errors for all four drives
before reporting the "timed out while idle" error.

We bought 16 new Fireballs and have gradually gone through all of them,
swapping out drives that reported these and other read-type errors.  The
problems sometimes go away on that drive number, or they may come
right back.  There is no obvious pattern.   

The only application is Diablo.  There are no users.   We have run
Diablo 9, 10 and 11 on these boxes (with local enhancements we wrote
to implement cancel flood feeds and a better history database) and the
volume of errors did not really change as we moved from one version to the
next.

Drives 1-3 form a CCD stripe set (12GB) for /news and its spool directories.
Drive 0 normally contains 150M root, 260M swap and the rest for /usr.
Each box is connected with one or two 21140 ether cards, running in FDX
mode (using the driver directly from the author to get FDX).  These
are connected to a full-duplex switch connected to a 100Mbit FDX link to
a 7513 router.   The 7513 shows each box doing about 4Mbit/sec of data.

Each box has at least five full news feeds, and a half-dozen partial
feeds.  There is also a vertical feed to another box that combines
the sum from each box, cross-feeds between first two boxes and pass that
sum on to another box running inn, all over FDX links.   

These systems are beating on the spool drives very hard.

The boxes run anywhere from 20 minutes to two days before encountering
a SCSI error, with the average life around two hours (although when awakened
in the dead of night, it seems more frequent than that).   Sometimes we
can't even get through the cleanup-after-crash before getting another
error, a time when there is minimal load on the system.

NONE of the messages end up getting logged, unless it is one of those
VERY RARE events where the system somewhat recovers and keeps running.
Once in a while, they are accompanied by a panic which does not get logged.

To attempt to isolate/eliminate all these problems, we have:
o Replaced SCSI controller boards, not just moved them from one box to
  the other.
o Tried disabling the drive termination and using a real terminating pack.
o Tried running the 2940Us in non-Ultra mode.
o Replaced all SCSI cables with Ultra-compliant cables (<1.5M).
o Replaced drives repeatedly.  Also reformatted several and re-tried them.
o Tried running drives in ASYNC mode, slowest mode available.
o Removed half of main memory and after another error, swapped the other
o 64Meg in, in case of a memory flaw.  Note all memory is parity memory.
o Removed and run without cache module.  
o Since we found the four chips on the fireballs live up to the name and
  become too hot to touch during active operation, we installed 100CFM
  fans blowing directly across the drive boards at a distance of 3".
o Concerned about power glitches, we added as much as 500 watts of power
  supply to one units, the other units increased to 300 and 400watts. 
  These supplies are connected to a 50KVA UPS system.
o Operated drives in horizontal, vertical and upside-down positions.
o Measured vibration.  Nowhere near the 1G operating limit.
o Put a logic analyzer on the SCSI bus, looking for glitches and noise.
  Found some glitches which were resolved by replacing cables some weeks
  back.  Also found the adapter (perhaps at drivers direction)  apparently
  issues resets from time to time, some occurring at the failure point.
  At other times, signals were well within noise and voltage specs.
o Monitored +5 and +12 for excessive jitter or undervolts.  Replaced two
  supplies that had more than 0.06 volt ripple, well under the allowed spec.
  Supplies currently being used have under 0.035 volt ripple when drives
  are active.
o Since the driver also sometimes spits out one or more Drive Not Ready
  errors at or near the "timed out while idle" error, tried wire-wrapping
  the drive selects on the drives, in case the jumper blocks were not making
  good contact.

Two months of isolating and trial and it still fails.

Now, all of that said, if I build a box with two Seagate drives (a
2GB and a 9GB was all I had handy), I don't get anymore errors, at least
not over the one week period this was tried.   Now, there are a lot of
other variables here, and I plan to re-test this config with the same
two drives using the 970921 kernel, but it was an interesting point.

Also, I can take these Quantum drives that get these errors and stick them
in a DECstation running NetBSD and I don't get any errors, but I can't
load them down the same way for various reasons.

It seems unlikely that Quantum managed to send us 16 crappy drives that
have a built-in personal grudge against FreeBSD.  Granted, that out of the
16, we have had two completely fail to date and won't format under the
Adaptec BIOS, but even so, the fact that they work on different platforms
suggest that there might be some timing issue in the within SCSI driver
that hasn't been stamped out yet.  The continuing appearance of the
illogical-sounding "timed out while idle" message, makes us think there is
still a driver problem that Quantum drives tickle perhaps more frequently
than other brands.

Quantum support doesn't seem to be aware of any problems like this, at
least none that the low-level tech I reached would admit to.  There 
doesn't appear to be any newer firmware for these drives either.  I
didn't get much of a warm fuzzy from these guys, probably because I wasn't
running Windows '95, and that seems to be all they know about.


I'll be happy to file this as a send-PR, but if the SCSI driver is known
to still be broken on this point,  I won't waste the effort.
Any feedback would be appreciated.


Frank Durda IV - only these addresses work:|
   <uhclem.ds3%nemesis.lonestar.org>       |
or <uhclem.ds3%rwsystr.nkn.net>            |
These Anti-spam addresses expire Oct. 15th |