From owner-freebsd-fs  Sat Feb  1 00:11:11 1997
Return-Path: <owner-fs>
Received: (from root@localhost)
          by freefall.freebsd.org (8.8.5/8.8.5) id AAA10064
          for fs-outgoing; Sat, 1 Feb 1997 00:11:11 -0800 (PST)
Received: from gatekeeper.tsc.tdk.com (root@gatekeeper.tsc.tdk.com [207.113.159.21])
          by freefall.freebsd.org (8.8.5/8.8.5) with ESMTP id AAA10047;
          Sat, 1 Feb 1997 00:11:06 -0800 (PST)
Received: from sunrise.gv.tsc.tdk.com (root@sunrise.gv.tsc.tdk.com [192.168.241.191])
          by gatekeeper.tsc.tdk.com (8.8.4/8.8.4) with ESMTP
	  id AAA15729; Sat, 1 Feb 1997 00:11:05 -0800 (PST)
Received: from salsa.gv.tsc.tdk.com (salsa.gv.tsc.tdk.com [192.168.241.194])
          by sunrise.gv.tsc.tdk.com (8.8.4/8.8.4) with ESMTP
	  id AAA21337; Sat, 1 Feb 1997 00:11:04 -0800 (PST)
Received: (from gdonl@localhost)
          by salsa.gv.tsc.tdk.com (8.8.4/8.8.4)
	  id AAA28411; Sat, 1 Feb 1997 00:11:02 -0800 (PST)
Date: Sat, 1 Feb 1997 00:11:02 -0800 (PST)
From: Don Lewis <Don.Lewis@tsc.tdk.com>
Message-Id: <199702010811.AAA28411@salsa.gv.tsc.tdk.com>
To: freebsd-fs@freebsd.org, freebsd-scsi@freebsd.org
Subject: SCSI disk MEDIUM ERROR with a few twists
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

I was recently bitten by a disk that developed a bad sector and am somewhat
disturbed about a few things.  First the vital statistics:
	FreeBSD 2.1.6
	Adaptec 2940UW
	Seagate ST31051N (Hawk)
	AWRE and ARRE are both enabled

This machine is our news server.  The disk in question holds /, /usr, and
the partition where the history file lives.  The latter partition is the
one that developed the problem.  Unknown to me, the problem cropped up
a couple weeks ago, which brings us to the first problem:

	/etc/daily doesn't report this

but these lists probably aren't the right place to report that.

This problem was logged, all the way to the point where FAILURE was reported
once on January 16.  It occurred a bunch of times on January 18.  Things
were quiet until January 28, when I noticed that the machine wasn't feeding
any news.  I had a bunch of rlogin sessions open to the machine from the
machine in my office, and when I tried to run any commands it responded with
a message indicating some sort of I/O error.  When I checked the machine's
console, it was complaining about sd0 being not-ready.  It decided to try to
reboot when I typed on the keyboard, but hung because the disk wasn't ready.
I power cycled the machine, and it started to boot but fsck decided that
the one partition was hosed.  I ran fsck manually, and things looked pretty
grim.  Fsck complained about bad blocks, and the kernal complained about
MEDIUM ERRORs (but I didn't think to write down the block numbers).  Some
of the messages from fsck made it pretty obvious that a number of inodes
had been overwritten with total garbage (preposterous file sizes, block
numbers way out of range), and the block numbers in either the inode or
an indirect block for the newsgroups file had been overwritten with
similar trash as well.  I ran fsck a few times answering "yes" until
things were clean.  The second problem is:

	During this final failure, something overwrite some number
	of good blocks with garbage data.

It could be the filesystem, the SCSI driver, or the drive firmware.

I then dump'ed everything on the disk in preparation for replacing it
because I thought it was toast.  During the process of dumping the
news partition, I got a kernel complaint about a MEDIUM ERROR, but dump
didn't complain.  I also saved this partition using tar, and I got a
MEDIUM ERROR when it was copying the history.pag file, but tar didn't
complain.  This brings us to the third problem:

	It appears that these errors aren't reported to userland

I don't know whether the SCSI code isn't reporting this to the filesystem,
or the filesystem isn't reporting this to userland code, but dump didn't
seem to see a problem, tar didn't seem to see a problem.  Also innd didn't
seem to see a problem even though it appears to do the proper checking.
It just seemed to accept duplicate articles on occasion, which I ended
up reporting to inn-bugs.  I guess I'll have to retract that bug report.
I looked at the SCSI code in -current, and it's error handing seemed to
be similar, so I hope y'all are interested.

Before replacing the drive, I decided to run the Adaptec disk verification.
It found a grand total of one bad sector and remapped it.  The only
remaining damage was that fsck had deleted my newsgroups file and
history.pag had one formerly bad sector.  Since the disk didn't appear
to be hopeless, I replaced the newsgroups file and rebuilt history.pag,
and things have been working flawlessly ever since.

			---  Truck

From owner-freebsd-fs  Sat Feb  1 05:51:09 1997
Return-Path: <owner-fs>
Received: (from root@localhost)
          by freefall.freebsd.org (8.8.5/8.8.5) id FAA21445
          for fs-outgoing; Sat, 1 Feb 1997 05:51:09 -0800 (PST)
Received: from sax.sax.de (sax.sax.de [193.175.26.33])
          by freefall.freebsd.org (8.8.5/8.8.5) with SMTP id FAA21440;
          Sat, 1 Feb 1997 05:51:03 -0800 (PST)
Received: (from uucp@localhost) by sax.sax.de (8.6.12/8.6.12-s1) with UUCP id OAA05267; Sat, 1 Feb 1997 14:50:44 +0100
Received: (from j@localhost) by uriah.heep.sax.de (8.8.5/8.6.9) id OAA17384; Sat, 1 Feb 1997 14:29:17 +0100 (MET)
Message-ID: <Mutt.19970201142917.j@uriah.heep.sax.de>
Date: Sat, 1 Feb 1997 14:29:17 +0100
From: j@uriah.heep.sax.de (J Wunsch)
To: Don.Lewis@tsc.tdk.com (Don Lewis)
Cc: freebsd-fs@freebsd.org, freebsd-scsi@freebsd.org
Subject: Re: SCSI disk MEDIUM ERROR with a few twists
References: <199702010811.AAA28411@salsa.gv.tsc.tdk.com>
X-Mailer: Mutt 0.55-PL10
Mime-Version: 1.0
X-Phone: +49-351-2012 669
X-PGP-Fingerprint: DC 47 E6 E4 FF A6 E9 8F  93 21 E0 7D F9 12 D6 4E
Reply-To: joerg_wunsch@uriah.heep.sax.de (Joerg Wunsch)
In-Reply-To: <199702010811.AAA28411@salsa.gv.tsc.tdk.com>; from Don Lewis on Feb 1, 1997 00:11:02 -0800
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

As Don Lewis wrote:

(It would be fine if you could structure your report better.  It's
very hard to browse through, all the paragraphs were filled up with
words where it's hard to figure out the essence of your problem.)

> 	/etc/daily doesn't report this

(and others don't report this)

Of course.  That's because buffered writes cannot report media errors
to their caller.  The caller has already got an OK indication about
the write operation, when the device driver finally notices the write
error.  All the driver can do at this point is syslogging the problem.

You ought to check your syslog regularly.  The easiest way is to drop
it onto all your logged in terminals :) (seriously, i do).

> It could be the filesystem, the SCSI driver, or the drive firmware.

It could be the drive itself.

What MEDIUM ERRORs are these?  You forgot to quote the most important
thing, the driver message.

> I don't know whether the SCSI code isn't reporting this to the filesystem,
> or the filesystem isn't reporting this to userland code, but dump didn't
> seem to see a problem, tar didn't seem to see a problem.

It's interesting to know that dump didn't see the problem, since dump
operates on the raw device, where error reporting is possible.  Are
you sure they were _unrecovered_ medium errors, i.e. the kernel didn't
successfully retry them?  Again, please *quote* the error messages,
instead of assuming we know them.

> Before replacing the drive, I decided to run the Adaptec disk verification.
> It found a grand total of one bad sector and remapped it.  The only
> remaining damage was that fsck had deleted my newsgroups file and
> history.pag had one formerly bad sector.  Since the disk didn't appear
> to be hopeless, I replaced the newsgroups file and rebuilt history.pag,
> and things have been working flawlessly ever since.

I wouldn't use that disk for serious work again.  It's certainly good
for storing news articles, but no longer reliable enough for storing
your history database there.

Also, go through SCSI reformatting it.  This will cause the drive to
recreate the bad sector table as necessary.  You can even do this
without using the adapter BIOS, there's always /sbin/scsiformat for
this.  I've once recovered another Seacrate drive that suffered from
medium errors, and am using this until now (more than one year after
those problems).  However, i resorted it to a scratch drive for
release testing etc., and do no longer use it for mission-critical
work.

-- 
cheers, J"org

joerg_wunsch@uriah.heep.sax.de -- http://www.sax.de/~joerg/ -- NIC: JW11-RIPE
Never trust an operating system you don't have sources for. ;-)

From owner-freebsd-fs  Sat Feb  1 06:25:18 1997
Return-Path: <owner-fs>
Received: (from root@localhost)
          by freefall.freebsd.org (8.8.5/8.8.5) id GAA23918
          for fs-outgoing; Sat, 1 Feb 1997 06:25:18 -0800 (PST)
Received: from gatekeeper.tsc.tdk.com (root@gatekeeper.tsc.tdk.com [207.113.159.21])
          by freefall.freebsd.org (8.8.5/8.8.5) with ESMTP id GAA23897;
          Sat, 1 Feb 1997 06:25:13 -0800 (PST)
Received: from sunrise.gv.tsc.tdk.com (root@sunrise.gv.tsc.tdk.com [192.168.241.191])
          by gatekeeper.tsc.tdk.com (8.8.4/8.8.4) with ESMTP
	  id GAA18114; Sat, 1 Feb 1997 06:25:02 -0800 (PST)
Received: from salsa.gv.tsc.tdk.com (salsa.gv.tsc.tdk.com [192.168.241.194])
          by sunrise.gv.tsc.tdk.com (8.8.4/8.8.4) with ESMTP
	  id GAA29183; Sat, 1 Feb 1997 06:25:00 -0800 (PST)
Received: (from gdonl@localhost)
          by salsa.gv.tsc.tdk.com (8.8.4/8.8.4)
	  id GAA28908; Sat, 1 Feb 1997 06:24:59 -0800 (PST)
From: Don Lewis <Don.Lewis@tsc.tdk.com>
Message-Id: <199702011424.GAA28908@salsa.gv.tsc.tdk.com>
Date: Sat, 1 Feb 1997 06:24:59 -0800
In-Reply-To: j@uriah.heep.sax.de (J Wunsch)
       "Re: SCSI disk MEDIUM ERROR with a few twists" (Feb  1,  2:29pm)
X-Mailer: Mail User's Shell (7.2.6 alpha(3) 7/19/95)
To: joerg_wunsch@uriah.heep.sax.de (Joerg Wunsch),
        Don.Lewis@tsc.tdk.com (Don Lewis)
Subject: Re: SCSI disk MEDIUM ERROR with a few twists
Cc: freebsd-fs@freebsd.org, freebsd-scsi@freebsd.org
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

On Feb 1,  2:29pm, J Wunsch wrote:
} Subject: Re: SCSI disk MEDIUM ERROR with a few twists
} As Don Lewis wrote:
} 
} > 	/etc/daily doesn't report this
} 
} (and others don't report this)
} 
} Of course.  That's because buffered writes cannot report media errors
} to their caller.  The caller has already got an OK indication about
} the write operation, when the device driver finally notices the write
} error.  All the driver can do at this point is syslogging the problem.

Yes, but this is the "unrecovered read error" so often mentioned in the
freebsd-scsi mail archive.  Also, tar and dump were definitely reading
it.  INN was probably doing both.

} You ought to check your syslog regularly.  The easiest way is to drop
} it onto all your logged in terminals :) (seriously, i do).

A syslog scanner is on my list of things to do.

} > It could be the filesystem, the SCSI driver, or the drive firmware.
} 
} It could be the drive itself.

The MEDIUM ERROR itself and the falling offline a week or so later
are definitely the fault of the drive.  That the error wasn't reported
to userland lies somewhere between the driver and userland, inclusive.

} What MEDIUM ERRORs are these?  You forgot to quote the most important
} thing, the driver message.

Ok, here it is:

Jan 18 04:30:33 news /kernel: sd0(ahc0:0:0): MEDIUM ERROR info:14683a asc:11,0 Unrecovered read error field replaceable unit: ea sks:80,11
Jan 18 04:30:34 news /kernel: , retries:4
Jan 18 04:30:35 news /kernel: sd0(ahc0:0:0): MEDIUM ERROR info:14683a asc:11,0 Unrecovered read error field replaceable unit: ea sks:80,11 
Jan 18 04:30:35 news /kernel: , retries:3
Jan 18 04:30:36 news /kernel: sd0(ahc0:0:0): MEDIUM ERROR info:14683a asc:11,0 Unrecovered read error field replaceable unit: ea sks:80,11
Jan 18 04:30:38 news /kernel: , retries:2
Jan 18 04:30:42 news /kernel: sd0(ahc0:0:0): MEDIUM ERROR info:14683a asc:11,0 Unrecovered read error field replaceable unit: ea sks:80,11
Jan 18 04:30:42 news /kernel: , retries:1
Jan 18 04:30:43 news /kernel: sd0(ahc0:0:0): MEDIUM ERROR info:14683a asc:11,0 Unrecovered read error field replaceable unit: ea sks:80,11
Jan 18 04:30:44 news /kernel: , FAILURE

Always the same info:#.

} > I don't know whether the SCSI code isn't reporting this to the filesystem,
} > or the filesystem isn't reporting this to userland code, but dump didn't
} > seem to see a problem, tar didn't seem to see a problem.
} 
} It's interesting to know that dump didn't see the problem, since dump
} operates on the raw device, where error reporting is possible.  Are
} you sure they were _unrecovered_ medium errors, i.e. the kernel didn't
} successfully retry them?  Again, please *quote* the error messages,
} instead of assuming we know them.

Actually I'm not sure if it was recovered or not when I ran dump.  I
was running in single user at the time, so it was not logged.  It was the
same basic message, but I don't remember if it got all the way to FAILURE.
I didn't decide that I should report this until I had seen how badly the
filesystem *appeared* to have been munched by what turned out to be one
bad sector.  By that time, the sector had been remapped and I could no
longer reproduce the problem.

I also can't quote messages from it's death throes before it wedged,
because this disk also contains /var and nothing was syslogged until
after I got the machine running multi-user again.  I *think* the message
was: "Logical unit is in process of becoming ready", but if so it was
lying.

} > Before replacing the drive, I decided to run the Adaptec disk verification.
} > It found a grand total of one bad sector and remapped it.  The only
} > remaining damage was that fsck had deleted my newsgroups file and
} > history.pag had one formerly bad sector.  Since the disk didn't appear
} > to be hopeless, I replaced the newsgroups file and rebuilt history.pag,
} > and things have been working flawlessly ever since.
} 
} I wouldn't use that disk for serious work again.  It's certainly good
} for storing news articles, but no longer reliable enough for storing
} your history database there.

If it was more than one sector it would already be gone, but in this
case I'm going to leave it running and keep a very close eye on it.
It gave me at least two weeks warning last time.  If it gets sick again,
then I can at least file a more complete report ;-)  Are there any
experiments you want me to try?

} Also, go through SCSI reformatting it.  This will cause the drive to
} recreate the bad sector table as necessary.  You can even do this
} without using the adapter BIOS, there's always /sbin/scsiformat for
} this.

The painful part is that this is the root disk, and I'm pretty sure the
2.1.x fixit disk doesn't contain scsiformat.  Doesn't remapping the sector
add the original to the drive's grown defect list?

			---  Truck