From owner-freebsd-hackers Tue Feb 24 08:59:03 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id IAA00668 for freebsd-hackers-outgoing; Tue, 24 Feb 1998 08:59:03 -0800 (PST) (envelope-from owner-freebsd-hackers@FreeBSD.ORG) Received: from Kitten.mcs.com (Kitten.mcs.com [192.160.127.90]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id IAA00659 for ; Tue, 24 Feb 1998 08:58:53 -0800 (PST) (envelope-from karl@Mars.mcs.net) Received: from Mars.mcs.net (karl@Mars.mcs.net [192.160.127.85]) by Kitten.mcs.com (8.8.7/8.8.2) with ESMTP id KAA25798 for ; Tue, 24 Feb 1998 10:58:42 -0600 (CST) Received: (from karl@localhost) by Mars.mcs.net (8.8.7/8.8.2) id KAA19022; Tue, 24 Feb 1998 10:58:42 -0600 (CST) Message-ID: <19980224105842.07731@mcs.net> Date: Tue, 24 Feb 1998 10:58:42 -0600 From: Karl Denninger To: hackers@FreeBSD.ORG Subject: SCSI Sense ASC 11, ASCQ 0x0c - Unrecovered read errors Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.84 Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Hi folks, I have a question... Right now, as the driver stands, if you get a sense return on a disk of 0x11,0x0c ("Unrecovered read error - recommend rewrite the block"), the driver does not attempt to do anything about it. Why? You're screwed in this case - the data is gone. But, some RAID controllers (notably the CMD adapters) will *FIX* such an error if you write back to the block. Here's the scenario: 1) You have a failure on a data drive. It gets reported back with sense ASC 0x11, 0x0c. 2) The driver does not attempt to do anything other than report the error. This sounds like bogus behavior to me. Here's why: You've ALREADY lost the data. There is no harm in trying to "fix it". Thus, why not do the following: a) Attempt a forced reassign of the block. b) If that FAILS, write zeros into the block. Why do these things you ask? Simple: 1) The error, if repeated (or even singly) may cause a panic. If its in a swap area, for example, you're screwed - you're probably reading back a page of an executable from the paging space, and if its corrupted you're going down. 2) If its a data file you MIGHT die. There's no way to know. 3) IF YOU DON'T "FIX" IT, YOU WILL GET KILLED EVENTUALLY. With a regular disk, (a) above will succeed. You may still crash, but at least you should come back up. If the data was a file, its gone anyway - likewise for a directory. There is no harm in trying to prevent FUTURE errors at that point. If you have a RAID adapter, you got the error because BOTH the parity and primary data were unreadable. (a) will probably fail; most RAID controllers refuse reassignment. Writing zeros will be buffered in the controller though - now the error is "gone". A rebuild (which should already be in process) now fixes the error *permanently*. What am I missing here, and why isn't this done? -- -- Karl Denninger (karl@MCS.Net)| MCSNet - Serving Chicagoland and Wisconsin http://www.mcs.net/ | T1's from $600 monthly to FULL DS-3 Service | NEW! K56Flex support on ALL modems Voice: [+1 312 803-MCS1 x219]| EXCLUSIVE NEW FEATURE ON ALL PERSONAL ACCOUNTS Fax: [+1 312 803-4929] | *SPAMBLOCK* Technology now included at no cost To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message