From owner-freebsd-scsi  Sat Feb  1 00:11:09 1997
Return-Path: <owner-freebsd-scsi>
Received: (from root@localhost)
          by freefall.freebsd.org (8.8.5/8.8.5) id AAA10052
          for freebsd-scsi-outgoing; Sat, 1 Feb 1997 00:11:09 -0800 (PST)
Received: from gatekeeper.tsc.tdk.com (root@gatekeeper.tsc.tdk.com [207.113.159.21])
          by freefall.freebsd.org (8.8.5/8.8.5) with ESMTP id AAA10047;
          Sat, 1 Feb 1997 00:11:06 -0800 (PST)
Received: from sunrise.gv.tsc.tdk.com (root@sunrise.gv.tsc.tdk.com [192.168.241.191])
          by gatekeeper.tsc.tdk.com (8.8.4/8.8.4) with ESMTP
	  id AAA15729; Sat, 1 Feb 1997 00:11:05 -0800 (PST)
Received: from salsa.gv.tsc.tdk.com (salsa.gv.tsc.tdk.com [192.168.241.194])
          by sunrise.gv.tsc.tdk.com (8.8.4/8.8.4) with ESMTP
	  id AAA21337; Sat, 1 Feb 1997 00:11:04 -0800 (PST)
Received: (from gdonl@localhost)
          by salsa.gv.tsc.tdk.com (8.8.4/8.8.4)
	  id AAA28411; Sat, 1 Feb 1997 00:11:02 -0800 (PST)
Date: Sat, 1 Feb 1997 00:11:02 -0800 (PST)
From: Don Lewis <Don.Lewis@tsc.tdk.com>
Message-Id: <199702010811.AAA28411@salsa.gv.tsc.tdk.com>
To: freebsd-fs@freebsd.org, freebsd-scsi@freebsd.org
Subject: SCSI disk MEDIUM ERROR with a few twists
Sender: owner-freebsd-scsi@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

I was recently bitten by a disk that developed a bad sector and am somewhat
disturbed about a few things.  First the vital statistics:
	FreeBSD 2.1.6
	Adaptec 2940UW
	Seagate ST31051N (Hawk)
	AWRE and ARRE are both enabled

This machine is our news server.  The disk in question holds /, /usr, and
the partition where the history file lives.  The latter partition is the
one that developed the problem.  Unknown to me, the problem cropped up
a couple weeks ago, which brings us to the first problem:

	/etc/daily doesn't report this

but these lists probably aren't the right place to report that.

This problem was logged, all the way to the point where FAILURE was reported
once on January 16.  It occurred a bunch of times on January 18.  Things
were quiet until January 28, when I noticed that the machine wasn't feeding
any news.  I had a bunch of rlogin sessions open to the machine from the
machine in my office, and when I tried to run any commands it responded with
a message indicating some sort of I/O error.  When I checked the machine's
console, it was complaining about sd0 being not-ready.  It decided to try to
reboot when I typed on the keyboard, but hung because the disk wasn't ready.
I power cycled the machine, and it started to boot but fsck decided that
the one partition was hosed.  I ran fsck manually, and things looked pretty
grim.  Fsck complained about bad blocks, and the kernal complained about
MEDIUM ERRORs (but I didn't think to write down the block numbers).  Some
of the messages from fsck made it pretty obvious that a number of inodes
had been overwritten with total garbage (preposterous file sizes, block
numbers way out of range), and the block numbers in either the inode or
an indirect block for the newsgroups file had been overwritten with
similar trash as well.  I ran fsck a few times answering "yes" until
things were clean.  The second problem is:

	During this final failure, something overwrite some number
	of good blocks with garbage data.

It could be the filesystem, the SCSI driver, or the drive firmware.

I then dump'ed everything on the disk in preparation for replacing it
because I thought it was toast.  During the process of dumping the
news partition, I got a kernel complaint about a MEDIUM ERROR, but dump
didn't complain.  I also saved this partition using tar, and I got a
MEDIUM ERROR when it was copying the history.pag file, but tar didn't
complain.  This brings us to the third problem:

	It appears that these errors aren't reported to userland

I don't know whether the SCSI code isn't reporting this to the filesystem,
or the filesystem isn't reporting this to userland code, but dump didn't
seem to see a problem, tar didn't seem to see a problem.  Also innd didn't
seem to see a problem even though it appears to do the proper checking.
It just seemed to accept duplicate articles on occasion, which I ended
up reporting to inn-bugs.  I guess I'll have to retract that bug report.
I looked at the SCSI code in -current, and it's error handing seemed to
be similar, so I hope y'all are interested.

Before replacing the drive, I decided to run the Adaptec disk verification.
It found a grand total of one bad sector and remapped it.  The only
remaining damage was that fsck had deleted my newsgroups file and
history.pag had one formerly bad sector.  Since the disk didn't appear
to be hopeless, I replaced the newsgroups file and rebuilt history.pag,
and things have been working flawlessly ever since.

			---  Truck