From owner-freebsd-current Tue Sep 23 21:57:26 1997 Return-Path: Received: (from root@localhost) by hub.freebsd.org (8.8.7/8.8.7) id VAA00560 for current-outgoing; Tue, 23 Sep 1997 21:57:26 -0700 (PDT) Received: from nemesis.lonestar.org ([204.178.74.200]) by hub.freebsd.org (8.8.7/8.8.7) with SMTP id VAA00554 for ; Tue, 23 Sep 1997 21:57:21 -0700 (PDT) Received: by nemesis.lonestar.org (Smail3.1.27.1 #22) id m0xDjU6-000twnC; Tue, 23 Sep 97 23:55 CDT Message-Id: Date: Tue, 23 Sep 97 23:55 CDT To: current@freebsd.org From: uhclem@nemesis.lonestar.org (Frank Durda IV) Sent: Tue Sep 23 1997, 23:55:34 CDT Subject: Is the SCSI "timed out while idle" bug supposed to be fixed? Cc: uhclem.ds3@nemesis.lonestar.org Sender: owner-freebsd-current@freebsd.org X-Loop: FreeBSD.org Precedence: bulk With all of the recent shift to tinkering with the kernel callout architecture, I must ask if everybody believes the "timed out while idle" class of SCSI errors have been fixed. The reason I ask is that I am still getting them, every few hours, on three different boxes, using the 2.2.2-RELENG-970921 kernel tree. The most recent (copied off the screen a few minutes ago since the system usually dies for good when this happens) sd3(ahc0:3:0) SCB 0x3 - timed out while idle, LASTPHASE ==0x01, SCSISIGI==0x00, sd3(ahc0:3:0) SEQADDR=0x6 SCSISEQ=0x12 SSTAT=0x05 SSTAT1=0xA sd3(ahc0:3:0) Queuing an Abort SCB sd3(ahc0:3:0) Abort Message Sent sd3(ahc0:3:0) SCB 3 - Abort Completed sd3(ahc0:3:0) no longer in timeout ----and the system dies here--- The system is still pingable, sometimes fools SNMP monitors and will even open a telnet session that hangs, but the system is not usable and there is no subsequent disk activity. I've been getting these for two months now, and have updated the kernel a couple times when I saw what appeared to be changes that purported to fix the problem go into the tree. The volume of "events" has gone down slightly on the newer kernels, and sometimes it is able to recover and keep going (although apps start dumping core, indicating to me that things didn't really come back sane), but they have not gone away. The hardware in question is three different Pentium 133 systems, all three with Intel motherboards of the same make. All systems have 128Meg of RAM and a cache module. The kernel is as distributed, except for the inclusion of the CCD driver, and the insertion of full duplex ethernet code for the DE driver (amazing this isn't in the distribution yet). Most non-present peripherals have their drivers omitted from the kernel config. (Config file available on request.) All three systems have four SCSI drives. With one exception on one system, all four drives are Quantum Fireball 4.3GB drives. SCSI controllers are Adaptec 2940U or 2940s. Drives are configured as logical units 0-3 and there are no tapes, CD-ROMs or other peripherals, apart from one 1.44Meg floppy. The one odd-system has a Seagate Baracuda for Drive 0. Drives 1 thru 3 are always Quantum 4.3GBs. Of the drives the SCSI driver decides to blame when an event occurs, 100% of the time it is a Quantum, almost always it complains about a problem with drive two or three. If you take drives two and three and put them at different positions on the SCSI cable (correcting termination), the next time you get an error, it will still complain about drive two or three. If you replace drives two and three with brand new drives, you will eventually get errors again, blaming drive two of three, mainly drive 3. Once in a while, it spits out not ready errors for all four drives before reporting the "timed out while idle" error. We bought 16 new Fireballs and have gradually gone through all of them, swapping out drives that reported these and other read-type errors. The problems sometimes go away on that drive number, or they may come right back. There is no obvious pattern. The only application is Diablo. There are no users. We have run Diablo 9, 10 and 11 on these boxes (with local enhancements we wrote to implement cancel flood feeds and a better history database) and the volume of errors did not really change as we moved from one version to the next. Drives 1-3 form a CCD stripe set (12GB) for /news and its spool directories. Drive 0 normally contains 150M root, 260M swap and the rest for /usr. Each box is connected with one or two 21140 ether cards, running in FDX mode (using the driver directly from the author to get FDX). These are connected to a full-duplex switch connected to a 100Mbit FDX link to a 7513 router. The 7513 shows each box doing about 4Mbit/sec of data. Each box has at least five full news feeds, and a half-dozen partial feeds. There is also a vertical feed to another box that combines the sum from each box, cross-feeds between first two boxes and pass that sum on to another box running inn, all over FDX links. These systems are beating on the spool drives very hard. The boxes run anywhere from 20 minutes to two days before encountering a SCSI error, with the average life around two hours (although when awakened in the dead of night, it seems more frequent than that). Sometimes we can't even get through the cleanup-after-crash before getting another error, a time when there is minimal load on the system. NONE of the messages end up getting logged, unless it is one of those VERY RARE events where the system somewhat recovers and keeps running. Once in a while, they are accompanied by a panic which does not get logged. To attempt to isolate/eliminate all these problems, we have: o Replaced SCSI controller boards, not just moved them from one box to the other. o Tried disabling the drive termination and using a real terminating pack. o Tried running the 2940Us in non-Ultra mode. o Replaced all SCSI cables with Ultra-compliant cables (<1.5M). o Replaced drives repeatedly. Also reformatted several and re-tried them. o Tried running drives in ASYNC mode, slowest mode available. o Removed half of main memory and after another error, swapped the other o 64Meg in, in case of a memory flaw. Note all memory is parity memory. o Removed and run without cache module. o Since we found the four chips on the fireballs live up to the name and become too hot to touch during active operation, we installed 100CFM fans blowing directly across the drive boards at a distance of 3". o Concerned about power glitches, we added as much as 500 watts of power supply to one units, the other units increased to 300 and 400watts. These supplies are connected to a 50KVA UPS system. o Operated drives in horizontal, vertical and upside-down positions. o Measured vibration. Nowhere near the 1G operating limit. o Put a logic analyzer on the SCSI bus, looking for glitches and noise. Found some glitches which were resolved by replacing cables some weeks back. Also found the adapter (perhaps at drivers direction) apparently issues resets from time to time, some occurring at the failure point. At other times, signals were well within noise and voltage specs. o Monitored +5 and +12 for excessive jitter or undervolts. Replaced two supplies that had more than 0.06 volt ripple, well under the allowed spec. Supplies currently being used have under 0.035 volt ripple when drives are active. o Since the driver also sometimes spits out one or more Drive Not Ready errors at or near the "timed out while idle" error, tried wire-wrapping the drive selects on the drives, in case the jumper blocks were not making good contact. Two months of isolating and trial and it still fails. Now, all of that said, if I build a box with two Seagate drives (a 2GB and a 9GB was all I had handy), I don't get anymore errors, at least not over the one week period this was tried. Now, there are a lot of other variables here, and I plan to re-test this config with the same two drives using the 970921 kernel, but it was an interesting point. Also, I can take these Quantum drives that get these errors and stick them in a DECstation running NetBSD and I don't get any errors, but I can't load them down the same way for various reasons. It seems unlikely that Quantum managed to send us 16 crappy drives that have a built-in personal grudge against FreeBSD. Granted, that out of the 16, we have had two completely fail to date and won't format under the Adaptec BIOS, but even so, the fact that they work on different platforms suggest that there might be some timing issue in the within SCSI driver that hasn't been stamped out yet. The continuing appearance of the illogical-sounding "timed out while idle" message, makes us think there is still a driver problem that Quantum drives tickle perhaps more frequently than other brands. Quantum support doesn't seem to be aware of any problems like this, at least none that the low-level tech I reached would admit to. There doesn't appear to be any newer firmware for these drives either. I didn't get much of a warm fuzzy from these guys, probably because I wasn't running Windows '95, and that seems to be all they know about. I'll be happy to file this as a send-PR, but if the SCSI driver is known to still be broken on this point, I won't waste the effort. Any feedback would be appreciated. Frank Durda IV - only these addresses work:| | or | These Anti-spam addresses expire Oct. 15th |