Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 4 Oct 97 17:04 CDT
From:      uhclem@nemesis.lonestar.org (Frank Durda IV)
To:        uhclem.bsd@FreeBSD.ORG, gibbs@FreeBSD.ORG, freebsd-bugs@FreeBSD.ORG
Subject:   Re: kern/4686  
Message-ID:  <m0xHcIt-000twhC@nemesis.lonestar.org>

next in thread | raw e-mail | index | archive | help
[0]Neither the aic7xxx driver nor the generic SCSI layer ever
[0]manually remap sectors on a drive.  I'd be interested in knowing
[0]what code in the aic7xxx driver you modified to remove block
[0]reassignment, as that feature doesn't exist.

Ah, serves me right for believing someone else without examining their
hack in detail.  All they did is cause the kernel to panic when the
message about "reassigning" was displayed, probably assuming that further
on down in the code, the reassignment was being explicitly done by
our driver.

(The PR was filed on the driver exactly as it stands in 2.2.5-BETA,
 with no modifications of any kind.)

I must say that the current error message the driver emits certainly implies
that reassignment is being performed by *somebody*.  Since not all drives do
this automatically, I assumed the driver was always doing the work, using
the explicit reassign commands that are available.  My apologies.

Perhaps the kernel error message should be changed to *suggest* that a
block reassignment should be done.

Almost as proof of what you claim, I got two errors in a row this
morning on the Baracuda back-to-back in the same block.  If reassignment
was being performed, this would imply that the reassigned spare was also
bad.  With no reassignment being performed, it just hit the same bad
block again.


[0]As to the aic7xxx/SCSI layer incorrectly reporting media errors,
[0]this just isn't possible.  The aic7xxx driver simply returns the
[0]sense information that the drive provides, and in this case, it
[0]is telling us that it believes it's media is bad.

Ok, is there an offset difference caused by the partitioning or slicing
of the drive?  Some of the errors would make sense if that was the case.


[0]So, if the aic7xxx driver isn't remapping your sectors, who is?

No one, I understand you clearly.  Must beat on person who gave me bad
info.


[0]Why don't these "bad blocks" show up during a "media verify" operation?

Ah, but if the Baracudas have it off by default, this would not be
explained.   It would also not explain a few cases where the Quantums
did find blocks exactly at the same block number reported by the driver
and the BIOS scan program, assuming the Quauntums are doing reassignment
by default.  I think something may still be odd here, although it is
possible the Adaptec scan program is lying or using an offset (seems
unlikely).


[0]As to your point about bus resets being dangerous if multiple targets are
[0]active, this really isn't the case.  Bus resets are used to recover devices
[0]that don't seem to be responding and the driver/generic SCSI layer can deal
[0]with the consequences of a bus reset if the code believes it is necessary.

I am quite familiar with this point (I wrote one of the first, if not the
first SCSI driver for a multitasking platform (68000 XENIX) back in 1984 for
the old Bernoulli drives.  However, I'm also aware that SCSI devices doing
write operations are allowed to abort uncleanly in response to the RST signal.
This has been in there since the original 1984 SCSI specification drafts.
It was put in there to allow tape devices to be aborted during write
operations that were repeatedly failing, ie running through half the length
of a 9-track tape drive writing permanent gaps over and over as it looked
for a good read-back of data just written, which I have seen happen.

The same "blow off what you are doing" exists in disk devices.  You can end
up with incompletely written blocks.  The Iomegas used to do it, leaving
sectors half written with CRC/ECC data from the previous occupant.  I know
for a fact that the old CDC SCSI hard disk drives did this too.  Granted
that SCSI-II provides "soft reset", but the old style (now called "hard
reset" in SCSI-II section 6.2.2 if I recall correctly) allows for the
blowing-away of operations in progress.

Then there is the additional problem of some vendors just not implementing
the "soft reset" correctly and they toss their unwritten write cache.
Very nasty.

Subsequently, the use of the big-hammer (RST signal) really should not be
used if any of the drives are performing write operations if at all possible.
It should be the last resort.  Even if you try to avoid using it during
write operations you initiated, there remains problems because drives
that are performing automatic error correction, block reassignment and
delayed cached writes are performing hidden write operations which can also
be aborted "at a bad time" by a bus RESET.


[0]One thing I do know about the FireBall ST is that the currently
[0]shipping firmware can become unstable under certain, rapid, seek
[0]patterns.  I doubt that a 1542 is capable of generating the load
[0]required to see this problem.

Makes sense, but we also see the problems on the Seagate 9GB Baracuda drives
too.  All the errors in the PR were from one of the Baracudas under load.
It is unlikely Seagate and Quantum share much firmware in common.

The question of why things eventually go into a loop displaying
messages for blocks all over the disk faster than you can read them
remains.  Unfortunately, the system always dies when this happens and
leave me no logs to send.  Only the intermittent message failure (as
included in the PR) leaves me with any logged evidence.


[0]You might try contacting Quantum Tech support to see if you can
[0]obatain new firmware in advance.

Indeed, we will try beating higher in the Quantum food chain on Monday.
Everybody we spoke with to date claimed there were no problems with
the drive firmware, there is no newer firmware, your OS is the problem,
why aren't you running Windows '95, etc.   Sigh.   

If possible, please send me (via EMAIL) your firms name and the name of
any technical contact you had with Quantum, just so I can point them to
a part of the company that has heard of problems with the ST drives.

Thanks for the reply.


Frank Durda IV - only these addresses work:|"The Knights who say "LETNi"
   <uhclem.bsd%nemesis.lonestar.org>       | demand... A SEGMENT REGISTER!!!"
or <uhclem.bsd%uhclem%rwsystr.nkn.net>     |"A what?"
These Anti-spam addresses expire Nov. 15th |"LETNi! LETNi! LETNi!" - 1983




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?m0xHcIt-000twhC>