From owner-freebsd-hackers Wed Feb 25 13:11:08 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id NAA18984 for freebsd-hackers-outgoing; Wed, 25 Feb 1998 13:11:08 -0800 (PST) (envelope-from owner-freebsd-hackers@FreeBSD.ORG) Received: from Kitten.mcs.com (Kitten.mcs.com [192.160.127.90]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id NAA18906 for ; Wed, 25 Feb 1998 13:10:54 -0800 (PST) (envelope-from karl@Mars.mcs.net) Received: from Mars.mcs.net (karl@Mars.mcs.net [192.160.127.85]) by Kitten.mcs.com (8.8.7/8.8.2) with ESMTP id PAA27295; Wed, 25 Feb 1998 15:10:43 -0600 (CST) Received: (from karl@localhost) by Mars.mcs.net (8.8.7/8.8.2) id PAA19930; Wed, 25 Feb 1998 15:10:42 -0600 (CST) Message-ID: <19980225151042.07314@mcs.net> Date: Wed, 25 Feb 1998 15:10:42 -0600 From: Karl Denninger To: "Justin T. Gibbs" Cc: hackers@FreeBSD.ORG Subject: Re: SCSI Sense ASC 11, ASCQ 0x0c - Unrecovered read errors References: <19980224105842.07731@mcs.net> <199802252057.NAA24440@pluto.plutotech.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.84 In-Reply-To: <199802252057.NAA24440@pluto.plutotech.com>; from Justin T. Gibbs on Wed, Feb 25, 1998 at 01:54:08PM -0700 Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Wed, Feb 25, 1998 at 01:54:08PM -0700, Justin T. Gibbs wrote: > > In article <19980224105842.07731@mcs.net> you wrote: > > Hi folks, > > > > I have a question... > > > > Right now, as the driver stands, if you get a sense return on a disk of > > 0x11,0x0c ("Unrecovered read error - recommend rewrite the block"), the > > driver does not attempt to do anything about it. > > > > Why? > > > > You're screwed in this case - the data is gone. But, some RAID controllers > > (notably the CMD adapters) will *FIX* such an error if you write back to the > > block. > > Is the data really gone? Isn't that for the user to decide? I've known > disks to report temporary media errors that "dissapear" after they are > moved, the temperature changes, or the moon goes full. > > > Here's the scenario: > > > > 1) You have a failure on a data drive. It gets reported back with > > sense ASC 0x11, 0x0c. > > > > 2) The driver does not attempt to do anything other than report the > > error. > > I don't believe that it is the driver's responsibility to take action > in this case. > > > This sounds like bogus behavior to me. Here's why: > > > > You've ALREADY lost the data. > > This is arguable. > > > There is no harm in trying to > > "fix it". Thus, why not do the following: > > How does the driver know what it means to fix it? If the bad block > is in the MBR, the system may well have this information somewhere > in core to restore the data. If the data is in the filesystem, > writing one pattern might cause the FS to crash the kernel or > confuse fsck, while another may minimize damage. If the "client" > of the driver is not going to get the data it expects, an error > should be returned, period. > > > a) Attempt a forced reassign of the block. > > b) If that FAILS, write zeros into the block. > > > > Why do these things you ask? Simple: > > > > 1) The error, if repeated (or even singly) may cause a panic. If its > > in a swap area, for example, you're screwed - you're probably > > reading back a page of an executable from the paging space, and if > > its corrupted you're going down. > > The swap pager should terminate the program(s) needing that block > if it receives an I/O error. This should not panic the system. > > If, on the other hand, you remap the block, and silently return garbage > data, you may well cause behavior that is recoverable. > > > 2) If its a data file you MIGHT die. There's no way to know. > > And the FS may be able to clean up it's data structures to minimize > the effect of a missing/corrupted block of data if you tell it that > the read operation failed. If you remap it and return garbage, who > knows what will happen. > > > 3) IF YOU DON'T "FIX" IT, YOU WILL GET KILLED EVENTUALLY. > > This need not be the case. > > > With a regular disk, (a) above will succeed. You may still crash, but at > > least you should come back up. If the data was a file, its gone anyway - > > likewise for a directory. There is no harm in trying to prevent FUTURE > > errors at that point. > > I have no problem with the client of the data taking some action to clear > an I/O error. There may even need to be an additional API to do this, > but the disk driver does not have sufficient information to make the > decision on how to perform that recovery. The only safe thing is to > report the error until some external action is taken. > > If the system is not properly dealing with EIO conditions, that is > certainly a bug, but your suggested fix is not a correct solution. This condition, right now, causes a *HANG* if it happens on *DATA FILES*. Not a panic, not an error return, not termination of the offending process. A hard system crash which requires a RESET to recover from. -- -- Karl Denninger (karl@MCS.Net)| MCSNet - Serving Chicagoland and Wisconsin http://www.mcs.net/ | T1's from $600 monthly to FULL DS-3 Service | NEW! K56Flex support on ALL modems Voice: [+1 312 803-MCS1 x219]| EXCLUSIVE NEW FEATURE ON ALL PERSONAL ACCOUNTS Fax: [+1 312 803-4929] | *SPAMBLOCK* Technology now included at no cost To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message