From owner-freebsd-scsi Sat Feb 1 00:11:09 1997 Return-Path: Received: (from root@localhost) by freefall.freebsd.org (8.8.5/8.8.5) id AAA10052 for freebsd-scsi-outgoing; Sat, 1 Feb 1997 00:11:09 -0800 (PST) Received: from gatekeeper.tsc.tdk.com (root@gatekeeper.tsc.tdk.com [207.113.159.21]) by freefall.freebsd.org (8.8.5/8.8.5) with ESMTP id AAA10047; Sat, 1 Feb 1997 00:11:06 -0800 (PST) Received: from sunrise.gv.tsc.tdk.com (root@sunrise.gv.tsc.tdk.com [192.168.241.191]) by gatekeeper.tsc.tdk.com (8.8.4/8.8.4) with ESMTP id AAA15729; Sat, 1 Feb 1997 00:11:05 -0800 (PST) Received: from salsa.gv.tsc.tdk.com (salsa.gv.tsc.tdk.com [192.168.241.194]) by sunrise.gv.tsc.tdk.com (8.8.4/8.8.4) with ESMTP id AAA21337; Sat, 1 Feb 1997 00:11:04 -0800 (PST) Received: (from gdonl@localhost) by salsa.gv.tsc.tdk.com (8.8.4/8.8.4) id AAA28411; Sat, 1 Feb 1997 00:11:02 -0800 (PST) Date: Sat, 1 Feb 1997 00:11:02 -0800 (PST) From: Don Lewis Message-Id: <199702010811.AAA28411@salsa.gv.tsc.tdk.com> To: freebsd-fs@freebsd.org, freebsd-scsi@freebsd.org Subject: SCSI disk MEDIUM ERROR with a few twists Sender: owner-freebsd-scsi@freebsd.org X-Loop: FreeBSD.org Precedence: bulk I was recently bitten by a disk that developed a bad sector and am somewhat disturbed about a few things. First the vital statistics: FreeBSD 2.1.6 Adaptec 2940UW Seagate ST31051N (Hawk) AWRE and ARRE are both enabled This machine is our news server. The disk in question holds /, /usr, and the partition where the history file lives. The latter partition is the one that developed the problem. Unknown to me, the problem cropped up a couple weeks ago, which brings us to the first problem: /etc/daily doesn't report this but these lists probably aren't the right place to report that. This problem was logged, all the way to the point where FAILURE was reported once on January 16. It occurred a bunch of times on January 18. Things were quiet until January 28, when I noticed that the machine wasn't feeding any news. I had a bunch of rlogin sessions open to the machine from the machine in my office, and when I tried to run any commands it responded with a message indicating some sort of I/O error. When I checked the machine's console, it was complaining about sd0 being not-ready. It decided to try to reboot when I typed on the keyboard, but hung because the disk wasn't ready. I power cycled the machine, and it started to boot but fsck decided that the one partition was hosed. I ran fsck manually, and things looked pretty grim. Fsck complained about bad blocks, and the kernal complained about MEDIUM ERRORs (but I didn't think to write down the block numbers). Some of the messages from fsck made it pretty obvious that a number of inodes had been overwritten with total garbage (preposterous file sizes, block numbers way out of range), and the block numbers in either the inode or an indirect block for the newsgroups file had been overwritten with similar trash as well. I ran fsck a few times answering "yes" until things were clean. The second problem is: During this final failure, something overwrite some number of good blocks with garbage data. It could be the filesystem, the SCSI driver, or the drive firmware. I then dump'ed everything on the disk in preparation for replacing it because I thought it was toast. During the process of dumping the news partition, I got a kernel complaint about a MEDIUM ERROR, but dump didn't complain. I also saved this partition using tar, and I got a MEDIUM ERROR when it was copying the history.pag file, but tar didn't complain. This brings us to the third problem: It appears that these errors aren't reported to userland I don't know whether the SCSI code isn't reporting this to the filesystem, or the filesystem isn't reporting this to userland code, but dump didn't seem to see a problem, tar didn't seem to see a problem. Also innd didn't seem to see a problem even though it appears to do the proper checking. It just seemed to accept duplicate articles on occasion, which I ended up reporting to inn-bugs. I guess I'll have to retract that bug report. I looked at the SCSI code in -current, and it's error handing seemed to be similar, so I hope y'all are interested. Before replacing the drive, I decided to run the Adaptec disk verification. It found a grand total of one bad sector and remapped it. The only remaining damage was that fsck had deleted my newsgroups file and history.pag had one formerly bad sector. Since the disk didn't appear to be hopeless, I replaced the newsgroups file and rebuilt history.pag, and things have been working flawlessly ever since. --- Truck