Date: Sun, 12 Sep 1999 14:30:01 -0700 (PDT) From: milt@vicor-nb.com (milt) To: freebsd-bugs@FreeBSD.org Subject: Re: kern/11226: Invalid files on disk after fsync Message-ID: <199909122130.OAA89290@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
The following reply was made to PR kern/11226; it has been noted by GNATS. From: milt@vicor-nb.com (milt) To: freebsd-gnats-submit@freebsd.org, milt@moth.vicor-nb.com, tedm@toybox.placo.com Cc: Subject: Re: kern/11226: Invalid files on disk after fsync Date: Sun, 12 Sep 1999 14:10:10 -0700 (PDT) HIYA > I was struck by the similarities between this and what happens with our > news server. > > Once every 2 or 3 months I come in to find the server rebooted itself > during the night - and I'm sure it's due to a SCSI error because the news > spool is always scrambled. Usually, multiple invocations of fsck will > eventually clear the garbage out but I've had to regenerate the spools once > because of it. > > The only one time that I have ever seen anything in the message log > regarding this is a rather cryptic log entry: > > Aug 11 06:12:59 herald /kernel: biodone: buffer already done > > This was made seconds before the system rebooted itself during one of those > times. > > The server has done up to 2 million articles in a day and I'm sure that it's > during the peak times that this has happened. > > I suspect there is some sequencing bug or other within the Adaptec SCSI > driver - of course I'm running with all go-fast options turned on including > async mounts on the spools. Output of dmesg follows: This problem has been one of the most frustrating I have encountered. I think, Ted, that you are the only one in the world who ever really believed my description. What I don't see in the current problem report is that we eventually realized that the problem was always accompanied by an invalid disk address in an inode. We found that the disk fragment listed for file address 0xA000 for one or more files had one or both of the bits in 0x18000 cleared. That still got us nowhere. It did explain why fsck was going nuts and destroying all sorts of things when this happened. Two things make this frustrating: 1. I tried for weeks to make it happen in a test environment. Got a few failures, but never in any controlled manner so I was never able to reproduce the failure reliably. 2. We managed to get out of the soup without understanding or fixing the problem. (I hate when that happens!) a. We now mount all file systems with noatime. Since most of our accesses are reads, noatime cuts the inode writes down quite a bit and this change reduced the problem from about 6 a week to about 1 every two weeks. (I don't have very good numbers on the frequency, the problems were spread out over 20 different hosts in 3 cities.) (It's rw,noatime in fstab - see man mount) It's one of the speedup things so probably you already have it. b. The problem has dissappeared. We haven't had trouble in the past coupla months. I can't prove why, but I think its because we are using NFS less these days. (Some perl scripts running on the offensive hosts are now using non-NFS methods to get data from other hosts - NFS traffic is reduced but not yet eliminated.) You sure have my sympathy. I bet it ain't fun watching files dissappear as you run fsck multiple times on your 50+ gig system! To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-bugs" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199909122130.OAA89290>