FreeBSD Mail Archives

Date:      Sun, 12 Sep 1999 14:30:01 -0700 (PDT)
From:      milt@vicor-nb.com (milt)
To:        freebsd-bugs@FreeBSD.org
Subject:   Re: kern/11226: Invalid files on disk after fsync
Message-ID:  <199909122130.OAA89290@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help

The following reply was made to PR kern/11226; it has been noted by GNATS.

From: milt@vicor-nb.com (milt)
To: freebsd-gnats-submit@freebsd.org, milt@moth.vicor-nb.com,
	tedm@toybox.placo.com
Cc:  
Subject: Re: kern/11226: Invalid files on disk after fsync
Date: Sun, 12 Sep 1999 14:10:10 -0700 (PDT)

 HIYA
 
 >  I was struck by the similarities between this and what happens with our
 > news server.
 >  
 >    Once every 2 or 3 months I come in to find the server rebooted itself
 >  during the night - and I'm sure it's due to a SCSI error because the news
 >  spool is always scrambled.  Usually, multiple invocations of fsck will
 >  eventually clear the garbage out but I've had to regenerate the spools once
 >  because of it.
 >  
 >    The only one time that I have ever seen anything in the message log
 >  regarding this is a rather cryptic log entry:
 >  
 >  Aug 11 06:12:59 herald /kernel: biodone: buffer already done
 >  
 >  This was made seconds before the system rebooted itself during one of those
 >  times.
 >  
 >  The server has done up to 2 million articles in a day and I'm sure that it's
 >  during the peak times that this has happened.
 >  
 >  I suspect there is some sequencing bug or other within the Adaptec SCSI
 >  driver - of course I'm running with all go-fast options turned on including
 >  async mounts on the spools.  Output of dmesg follows:
 
 This problem has been one of the most frustrating I have encountered.   I
 think, Ted, that you are the only one in the world who ever really
 believed my description.
 
 What I don't see in the current problem report is that we eventually
 realized that the problem was always accompanied by an invalid disk address
 in an inode.  We found that the disk fragment listed for file address
 0xA000 for one or more files had one or both of the bits in 0x18000 cleared.
 
 That still got us nowhere.   It did explain why fsck was going nuts and
 destroying all sorts of things when this happened.
 
 Two things make this frustrating:
 
 1. I tried for weeks to make it happen in a test environment.   Got a few
    failures, but never in any controlled manner so I was never able to
    reproduce the failure reliably.
 
 2. We managed to get out of the soup without understanding or fixing the
    problem.  (I hate when that happens!)
 
    a. We now mount all file systems with noatime.  Since most of our
       accesses are reads, noatime cuts the inode writes down quite a bit and
 	  this change reduced the problem from about 6 a week to about 1 every
 	  two weeks.  (I don't have very good numbers on the frequency, the 
 	  problems were spread out over 20 different hosts in 3 cities.)
 
 	  (It's rw,noatime in fstab - see man mount)   It's one of the speedup
 	  things so probably you already have it.
 
    b. The problem has dissappeared.  We haven't had trouble in the past
       coupla months.  I can't prove why, but I think its because we are
 	  using NFS less these days.  (Some perl scripts running on the
 	  offensive hosts are now using non-NFS methods to get data from other
 	  hosts - NFS traffic is reduced but not yet eliminated.)
 
 You sure have my sympathy.  I bet it ain't fun watching files dissappear as
 you run fsck multiple times on your 50+ gig system!
 


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-bugs" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199909122130.OAA89290>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation