From owner-freebsd-hackers  Tue Feb 24 08:59:03 1998
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id IAA00668
          for freebsd-hackers-outgoing; Tue, 24 Feb 1998 08:59:03 -0800 (PST)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from Kitten.mcs.com (Kitten.mcs.com [192.160.127.90])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id IAA00659
          for <hackers@freebsd.org>; Tue, 24 Feb 1998 08:58:53 -0800 (PST)
          (envelope-from karl@Mars.mcs.net)
Received: from Mars.mcs.net (karl@Mars.mcs.net [192.160.127.85]) by Kitten.mcs.com (8.8.7/8.8.2) with ESMTP id KAA25798 for <hackers@freebsd.org>; Tue, 24 Feb 1998 10:58:42 -0600 (CST)
Received: (from karl@localhost) by Mars.mcs.net (8.8.7/8.8.2) id KAA19022; Tue, 24 Feb 1998 10:58:42 -0600 (CST)
Message-ID: <19980224105842.07731@mcs.net>
Date: Tue, 24 Feb 1998 10:58:42 -0600
From: Karl Denninger  <karl@mcs.net>
To: hackers@FreeBSD.ORG
Subject: SCSI Sense ASC 11, ASCQ 0x0c - Unrecovered read errors
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.84
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

Hi folks,

I have a question...

Right now, as the driver stands, if you get a sense return on a disk of
0x11,0x0c ("Unrecovered read error - recommend rewrite the block"), the
driver does not attempt to do anything about it.

Why?

You're screwed in this case - the data is gone.  But, some RAID controllers
(notably the CMD adapters) will *FIX* such an error if you write back to the
block.

Here's the scenario:

1)	You have a failure on a data drive.  It gets reported back with
	sense ASC 0x11, 0x0c.

2)	The driver does not attempt to do anything other than report the
	error.  

This sounds like bogus behavior to me.  Here's why:

	You've ALREADY lost the data.  There is no harm in trying to 
	"fix it".  Thus, why not do the following:

	a)	Attempt a forced reassign of the block.
	b)	If that FAILS, write zeros into the block.

Why do these things you ask?  Simple:

1)	The error, if repeated (or even singly) may cause a panic.  If its
	in a swap area, for example, you're screwed - you're probably
	reading back a page of an executable from the paging space, and if
	its corrupted you're going down.

2)	If its a data file you MIGHT die.  There's no way to know.

3)	IF YOU DON'T "FIX" IT, YOU WILL GET KILLED EVENTUALLY.

With a regular disk, (a) above will succeed.  You may still crash, but at
least you should come back up.  If the data was a file, its gone anyway -
likewise for a directory.  There is no harm in trying to prevent FUTURE
errors at that point.

If you have a RAID adapter, you got the error because BOTH the parity and
primary data were unreadable.  (a) will probably fail; most RAID controllers
refuse reassignment.  Writing zeros will be buffered in the controller 
though - now the error is "gone".  A rebuild (which should already be in 
process) now fixes the error *permanently*.

What am I missing here, and why isn't this done?

--
-- 
Karl Denninger (karl@MCS.Net)| MCSNet - Serving Chicagoland and Wisconsin
http://www.mcs.net/          | T1's from $600 monthly to FULL DS-3 Service
			     | NEW! K56Flex support on ALL modems
Voice: [+1 312 803-MCS1 x219]| EXCLUSIVE NEW FEATURE ON ALL PERSONAL ACCOUNTS
Fax:   [+1 312 803-4929]     | *SPAMBLOCK* Technology now included at no cost

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message