Date: Sat, 4 Oct 97 17:04 CDT From: uhclem@nemesis.lonestar.org (Frank Durda IV) To: uhclem.bsd@FreeBSD.ORG, gibbs@FreeBSD.ORG, freebsd-bugs@FreeBSD.ORG Subject: Re: kern/4686 Message-ID: <m0xHcIt-000twhC@nemesis.lonestar.org>
next in thread | raw e-mail | index | archive | help
[0]Neither the aic7xxx driver nor the generic SCSI layer ever [0]manually remap sectors on a drive. I'd be interested in knowing [0]what code in the aic7xxx driver you modified to remove block [0]reassignment, as that feature doesn't exist. Ah, serves me right for believing someone else without examining their hack in detail. All they did is cause the kernel to panic when the message about "reassigning" was displayed, probably assuming that further on down in the code, the reassignment was being explicitly done by our driver. (The PR was filed on the driver exactly as it stands in 2.2.5-BETA, with no modifications of any kind.) I must say that the current error message the driver emits certainly implies that reassignment is being performed by *somebody*. Since not all drives do this automatically, I assumed the driver was always doing the work, using the explicit reassign commands that are available. My apologies. Perhaps the kernel error message should be changed to *suggest* that a block reassignment should be done. Almost as proof of what you claim, I got two errors in a row this morning on the Baracuda back-to-back in the same block. If reassignment was being performed, this would imply that the reassigned spare was also bad. With no reassignment being performed, it just hit the same bad block again. [0]As to the aic7xxx/SCSI layer incorrectly reporting media errors, [0]this just isn't possible. The aic7xxx driver simply returns the [0]sense information that the drive provides, and in this case, it [0]is telling us that it believes it's media is bad. Ok, is there an offset difference caused by the partitioning or slicing of the drive? Some of the errors would make sense if that was the case. [0]So, if the aic7xxx driver isn't remapping your sectors, who is? No one, I understand you clearly. Must beat on person who gave me bad info. [0]Why don't these "bad blocks" show up during a "media verify" operation? Ah, but if the Baracudas have it off by default, this would not be explained. It would also not explain a few cases where the Quantums did find blocks exactly at the same block number reported by the driver and the BIOS scan program, assuming the Quauntums are doing reassignment by default. I think something may still be odd here, although it is possible the Adaptec scan program is lying or using an offset (seems unlikely). [0]As to your point about bus resets being dangerous if multiple targets are [0]active, this really isn't the case. Bus resets are used to recover devices [0]that don't seem to be responding and the driver/generic SCSI layer can deal [0]with the consequences of a bus reset if the code believes it is necessary. I am quite familiar with this point (I wrote one of the first, if not the first SCSI driver for a multitasking platform (68000 XENIX) back in 1984 for the old Bernoulli drives. However, I'm also aware that SCSI devices doing write operations are allowed to abort uncleanly in response to the RST signal. This has been in there since the original 1984 SCSI specification drafts. It was put in there to allow tape devices to be aborted during write operations that were repeatedly failing, ie running through half the length of a 9-track tape drive writing permanent gaps over and over as it looked for a good read-back of data just written, which I have seen happen. The same "blow off what you are doing" exists in disk devices. You can end up with incompletely written blocks. The Iomegas used to do it, leaving sectors half written with CRC/ECC data from the previous occupant. I know for a fact that the old CDC SCSI hard disk drives did this too. Granted that SCSI-II provides "soft reset", but the old style (now called "hard reset" in SCSI-II section 6.2.2 if I recall correctly) allows for the blowing-away of operations in progress. Then there is the additional problem of some vendors just not implementing the "soft reset" correctly and they toss their unwritten write cache. Very nasty. Subsequently, the use of the big-hammer (RST signal) really should not be used if any of the drives are performing write operations if at all possible. It should be the last resort. Even if you try to avoid using it during write operations you initiated, there remains problems because drives that are performing automatic error correction, block reassignment and delayed cached writes are performing hidden write operations which can also be aborted "at a bad time" by a bus RESET. [0]One thing I do know about the FireBall ST is that the currently [0]shipping firmware can become unstable under certain, rapid, seek [0]patterns. I doubt that a 1542 is capable of generating the load [0]required to see this problem. Makes sense, but we also see the problems on the Seagate 9GB Baracuda drives too. All the errors in the PR were from one of the Baracudas under load. It is unlikely Seagate and Quantum share much firmware in common. The question of why things eventually go into a loop displaying messages for blocks all over the disk faster than you can read them remains. Unfortunately, the system always dies when this happens and leave me no logs to send. Only the intermittent message failure (as included in the PR) leaves me with any logged evidence. [0]You might try contacting Quantum Tech support to see if you can [0]obatain new firmware in advance. Indeed, we will try beating higher in the Quantum food chain on Monday. Everybody we spoke with to date claimed there were no problems with the drive firmware, there is no newer firmware, your OS is the problem, why aren't you running Windows '95, etc. Sigh. If possible, please send me (via EMAIL) your firms name and the name of any technical contact you had with Quantum, just so I can point them to a part of the company that has heard of problems with the ST drives. Thanks for the reply. Frank Durda IV - only these addresses work:|"The Knights who say "LETNi" <uhclem.bsd%nemesis.lonestar.org> | demand... A SEGMENT REGISTER!!!" or <uhclem.bsd%uhclem%rwsystr.nkn.net> |"A what?" These Anti-spam addresses expire Nov. 15th |"LETNi! LETNi! LETNi!" - 1983
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?m0xHcIt-000twhC>