From owner-freebsd-hackers  Wed Feb 25 13:11:08 1998
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id NAA18984
          for freebsd-hackers-outgoing; Wed, 25 Feb 1998 13:11:08 -0800 (PST)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from Kitten.mcs.com (Kitten.mcs.com [192.160.127.90])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id NAA18906
          for <hackers@freebsd.org>; Wed, 25 Feb 1998 13:10:54 -0800 (PST)
          (envelope-from karl@Mars.mcs.net)
Received: from Mars.mcs.net (karl@Mars.mcs.net [192.160.127.85]) by Kitten.mcs.com (8.8.7/8.8.2) with ESMTP id PAA27295; Wed, 25 Feb 1998 15:10:43 -0600 (CST)
Received: (from karl@localhost) by Mars.mcs.net (8.8.7/8.8.2) id PAA19930; Wed, 25 Feb 1998 15:10:42 -0600 (CST)
Message-ID: <19980225151042.07314@mcs.net>
Date: Wed, 25 Feb 1998 15:10:42 -0600
From: Karl Denninger  <karl@mcs.net>
To: "Justin T. Gibbs" <gibbs@plutotech.com>
Cc: hackers@FreeBSD.ORG
Subject: Re: SCSI Sense ASC 11, ASCQ 0x0c - Unrecovered read errors
References: <19980224105842.07731@mcs.net> <199802252057.NAA24440@pluto.plutotech.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.84
In-Reply-To: <199802252057.NAA24440@pluto.plutotech.com>; from Justin T. Gibbs on Wed, Feb 25, 1998 at 01:54:08PM -0700
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

On Wed, Feb 25, 1998 at 01:54:08PM -0700, Justin T. Gibbs wrote:
> 
> In article <19980224105842.07731@mcs.net> you wrote:
> > Hi folks,
> > 
> > I have a question...
> > 
> > Right now, as the driver stands, if you get a sense return on a disk of
> > 0x11,0x0c ("Unrecovered read error - recommend rewrite the block"), the
> > driver does not attempt to do anything about it.
> > 
> > Why?
> > 
> > You're screwed in this case - the data is gone.  But, some RAID controllers
> > (notably the CMD adapters) will *FIX* such an error if you write back to the
> > block.
> 
> Is the data really gone?  Isn't that for the user to decide?  I've known
> disks to report temporary media errors that "dissapear" after they are
> moved, the temperature changes, or the moon goes full.
> 
> > Here's the scenario:
> > 
> > 1)	You have a failure on a data drive.  It gets reported back with
> > 	sense ASC 0x11, 0x0c.
> > 
> > 2)	The driver does not attempt to do anything other than report the
> > 	error.  
> 
> I don't believe that it is the driver's responsibility to take action
> in this case.
> 
> > This sounds like bogus behavior to me.  Here's why:
> > 
> > 	You've ALREADY lost the data.
> 
> This is arguable.
> 
> >	There is no harm in trying to 
> > 	"fix it".  Thus, why not do the following:
> 
> How does the driver know what it means to fix it?  If the bad block
> is in the MBR, the system may well have this information somewhere
> in core to restore the data.  If the data is in the filesystem,
> writing one pattern might cause the FS to crash the kernel or
> confuse fsck, while another may minimize damage.  If the "client"
> of the driver is not going to get the data it expects, an error
> should be returned, period.
> 
> > 	a)	Attempt a forced reassign of the block.
> > 	b)	If that FAILS, write zeros into the block.
> > 
> > Why do these things you ask?  Simple:
> > 
> > 1)	The error, if repeated (or even singly) may cause a panic.  If its
> > 	in a swap area, for example, you're screwed - you're probably
> > 	reading back a page of an executable from the paging space, and if
> > 	its corrupted you're going down.
> 
> The swap pager should terminate the program(s) needing that block
> if it receives an I/O error.  This should not panic the system.
> 
> If, on the other hand, you remap the block, and silently return garbage
> data, you may well cause behavior that is recoverable.
> 
> > 2)	If its a data file you MIGHT die.  There's no way to know.
> 
> And the FS may be able to clean up it's data structures to minimize
> the effect of a missing/corrupted block of data if you tell it that
> the read operation failed.  If you remap it and return garbage, who
> knows what will happen.
> 
> > 3)	IF YOU DON'T "FIX" IT, YOU WILL GET KILLED EVENTUALLY.
> 
> This need not be the case.
> 
> > With a regular disk, (a) above will succeed.  You may still crash, but at
> > least you should come back up.  If the data was a file, its gone anyway -
> > likewise for a directory.  There is no harm in trying to prevent FUTURE
> > errors at that point.
> 
> I have no problem with the client of the data taking some action to clear
> an I/O error.  There may even need to be an additional API to do this,
> but the disk driver does not have sufficient information to make the
> decision on how to perform that recovery.  The only safe thing is to
> report the error until some external action is taken.
> 
> If the system is not properly dealing with EIO conditions, that is
> certainly a bug, but your suggested fix is not a correct solution.

This condition, right now, causes a *HANG* if it happens on *DATA FILES*.

Not a panic, not an error return, not termination of the offending process.
A hard system crash which requires a RESET to recover from.

--
-- 
Karl Denninger (karl@MCS.Net)| MCSNet - Serving Chicagoland and Wisconsin
http://www.mcs.net/          | T1's from $600 monthly to FULL DS-3 Service
			     | NEW! K56Flex support on ALL modems
Voice: [+1 312 803-MCS1 x219]| EXCLUSIVE NEW FEATURE ON ALL PERSONAL ACCOUNTS
Fax:   [+1 312 803-4929]     | *SPAMBLOCK* Technology now included at no cost

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message