From owner-freebsd-scsi  Sat Feb  1 07:20:55 1997
Return-Path: <owner-freebsd-scsi>
Received: (from root@localhost)
          by freefall.freebsd.org (8.8.5/8.8.5) id HAA26321
          for freebsd-scsi-outgoing; Sat, 1 Feb 1997 07:20:55 -0800 (PST)
Received: from sax.sax.de (sax.sax.de [193.175.26.33])
          by freefall.freebsd.org (8.8.5/8.8.5) with SMTP id HAA26311
          for <freebsd-scsi@freebsd.org>; Sat, 1 Feb 1997 07:20:48 -0800 (PST)
Received: (from uucp@localhost) by sax.sax.de (8.6.12/8.6.12-s1) with UUCP id QAA07013; Sat, 1 Feb 1997 16:20:42 +0100
Received: (from j@localhost) by uriah.heep.sax.de (8.8.5/8.6.9) id QAA06590; Sat, 1 Feb 1997 16:03:46 +0100 (MET)
Message-ID: <Mutt.19970201160346.j@uriah.heep.sax.de>
Date: Sat, 1 Feb 1997 16:03:46 +0100
From: j@uriah.heep.sax.de (J Wunsch)
To: Don.Lewis@tsc.tdk.com (Don Lewis)
Cc: freebsd-scsi@freebsd.org
Subject: Re: SCSI disk MEDIUM ERROR with a few twists
References: <199702011424.GAA28908@salsa.gv.tsc.tdk.com>
X-Mailer: Mutt 0.55-PL10
Mime-Version: 1.0
X-Phone: +49-351-2012 669
X-PGP-Fingerprint: DC 47 E6 E4 FF A6 E9 8F  93 21 E0 7D F9 12 D6 4E
Reply-To: joerg_wunsch@uriah.heep.sax.de (Joerg Wunsch)
In-Reply-To: <199702011424.GAA28908@salsa.gv.tsc.tdk.com>; from Don Lewis on Feb 1, 1997 06:24:59 -0800
Sender: owner-freebsd-scsi@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

As Don Lewis wrote:

> } It could be the drive itself.
> 
> The MEDIUM ERROR itself and the falling offline a week or so later
> are definitely the fault of the drive.  That the error wasn't reported
> to userland lies somewhere between the driver and userland, inclusive.

See my other mail.  For buffered (filesystem) writes, it's no
surprise.  Reads should, however, always report it.

> Jan 18 04:30:33 news /kernel: sd0(ahc0:0:0): MEDIUM ERROR info:14683a asc:11,0 Unrecovered read error field replaceable unit: ea sks:80,11

> Always the same info:#.

Which means: always the same block # (in hex).

> I also can't quote messages from it's death throes before it wedged,
> because this disk also contains /var and nothing was syslogged until
> after I got the machine running multi-user again.  I *think* the message
> was: "Logical unit is in process of becoming ready", but if so it was
> lying.

Btw., you should no longer see this error message now.  This case is
retried forever, until it either turns into a `real' error, or
eventually succeeds.

> It gave me at least two weeks warning last time.  If it gets sick again,
> then I can at least file a more complete report ;-)  Are there any
> experiments you want me to try?

Well, you could see why the read error isn't reported to userland
then. :-)

> } Also, go through SCSI reformatting it.  This will cause the drive to
> } recreate the bad sector table as necessary.  You can even do this
> } without using the adapter BIOS, there's always /sbin/scsiformat for
> } this.
> 
> The painful part is that this is the root disk, and I'm pretty sure the
> 2.1.x fixit disk doesn't contain scsiformat.

scsiformat is simple:

	scsi -s 7200 -f /dev/rsdX.ctl -c "4 0 0 0 0 0"

(Put it into background if you prefer, once started, you can't break
it with ^Z.)

>  Doesn't remapping the sector
> add the original to the drive's grown defect list?

Yes, but reformatting does IMHO often a more complete check, so if an
adjacent sector is flakey, it will more likely be put there as well.

We need a remapping tool as well.  Anybody here who ever dealt with
defect list management?  Since we do already know the block number
(from the info field in the syslog message), it should be easy to add
it to the defect list.

-- 
cheers, J"org

joerg_wunsch@uriah.heep.sax.de -- http://www.sax.de/~joerg/ -- NIC: JW11-RIPE
Never trust an operating system you don't have sources for. ;-)