Date: Wed, 25 Feb 1998 15:10:42 -0600 From: Karl Denninger <karl@mcs.net> To: "Justin T. Gibbs" <gibbs@plutotech.com> Cc: hackers@FreeBSD.ORG Subject: Re: SCSI Sense ASC 11, ASCQ 0x0c - Unrecovered read errors Message-ID: <19980225151042.07314@mcs.net> In-Reply-To: <199802252057.NAA24440@pluto.plutotech.com>; from Justin T. Gibbs on Wed, Feb 25, 1998 at 01:54:08PM -0700 References: <19980224105842.07731@mcs.net> <199802252057.NAA24440@pluto.plutotech.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Feb 25, 1998 at 01:54:08PM -0700, Justin T. Gibbs wrote:
>
> In article <19980224105842.07731@mcs.net> you wrote:
> > Hi folks,
> >
> > I have a question...
> >
> > Right now, as the driver stands, if you get a sense return on a disk of
> > 0x11,0x0c ("Unrecovered read error - recommend rewrite the block"), the
> > driver does not attempt to do anything about it.
> >
> > Why?
> >
> > You're screwed in this case - the data is gone. But, some RAID controllers
> > (notably the CMD adapters) will *FIX* such an error if you write back to the
> > block.
>
> Is the data really gone? Isn't that for the user to decide? I've known
> disks to report temporary media errors that "dissapear" after they are
> moved, the temperature changes, or the moon goes full.
>
> > Here's the scenario:
> >
> > 1) You have a failure on a data drive. It gets reported back with
> > sense ASC 0x11, 0x0c.
> >
> > 2) The driver does not attempt to do anything other than report the
> > error.
>
> I don't believe that it is the driver's responsibility to take action
> in this case.
>
> > This sounds like bogus behavior to me. Here's why:
> >
> > You've ALREADY lost the data.
>
> This is arguable.
>
> > There is no harm in trying to
> > "fix it". Thus, why not do the following:
>
> How does the driver know what it means to fix it? If the bad block
> is in the MBR, the system may well have this information somewhere
> in core to restore the data. If the data is in the filesystem,
> writing one pattern might cause the FS to crash the kernel or
> confuse fsck, while another may minimize damage. If the "client"
> of the driver is not going to get the data it expects, an error
> should be returned, period.
>
> > a) Attempt a forced reassign of the block.
> > b) If that FAILS, write zeros into the block.
> >
> > Why do these things you ask? Simple:
> >
> > 1) The error, if repeated (or even singly) may cause a panic. If its
> > in a swap area, for example, you're screwed - you're probably
> > reading back a page of an executable from the paging space, and if
> > its corrupted you're going down.
>
> The swap pager should terminate the program(s) needing that block
> if it receives an I/O error. This should not panic the system.
>
> If, on the other hand, you remap the block, and silently return garbage
> data, you may well cause behavior that is recoverable.
>
> > 2) If its a data file you MIGHT die. There's no way to know.
>
> And the FS may be able to clean up it's data structures to minimize
> the effect of a missing/corrupted block of data if you tell it that
> the read operation failed. If you remap it and return garbage, who
> knows what will happen.
>
> > 3) IF YOU DON'T "FIX" IT, YOU WILL GET KILLED EVENTUALLY.
>
> This need not be the case.
>
> > With a regular disk, (a) above will succeed. You may still crash, but at
> > least you should come back up. If the data was a file, its gone anyway -
> > likewise for a directory. There is no harm in trying to prevent FUTURE
> > errors at that point.
>
> I have no problem with the client of the data taking some action to clear
> an I/O error. There may even need to be an additional API to do this,
> but the disk driver does not have sufficient information to make the
> decision on how to perform that recovery. The only safe thing is to
> report the error until some external action is taken.
>
> If the system is not properly dealing with EIO conditions, that is
> certainly a bug, but your suggested fix is not a correct solution.
This condition, right now, causes a *HANG* if it happens on *DATA FILES*.
Not a panic, not an error return, not termination of the offending process.
A hard system crash which requires a RESET to recover from.
--
--
Karl Denninger (karl@MCS.Net)| MCSNet - Serving Chicagoland and Wisconsin
http://www.mcs.net/ | T1's from $600 monthly to FULL DS-3 Service
| NEW! K56Flex support on ALL modems
Voice: [+1 312 803-MCS1 x219]| EXCLUSIVE NEW FEATURE ON ALL PERSONAL ACCOUNTS
Fax: [+1 312 803-4929] | *SPAMBLOCK* Technology now included at no cost
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19980225151042.07314>
