From owner-freebsd-hackers  Tue Feb 25 19:20:56 1997
Return-Path: <owner-hackers>
Received: (from root@localhost)
          by freefall.freebsd.org (8.8.5/8.8.5) id TAA09732
          for hackers-outgoing; Tue, 25 Feb 1997 19:20:56 -0800 (PST)
Received: from dg-rtp.dg.com (dg-rtp.rtp.dg.com [128.222.1.2])
          by freefall.freebsd.org (8.8.5/8.8.5) with SMTP id TAA09727
          for <freebsd-hackers@freefall.FreeBSD.org>; Tue, 25 Feb 1997 19:20:51 -0800 (PST)
Received: by dg-rtp.dg.com (5.4R3.10/dg-rtp-v02)
	id AA09918; Tue, 25 Feb 1997 22:20:08 -0500
Received: from ponds by dg-rtp.dg.com.rtp.dg.com; Tue, 25 Feb 1997 22:20 EST
Received: from lakes.water.net (lakes [10.0.0.3]) by ponds.water.net (8.8.3/8.7.3) with ESMTP id VAA19672 for <freebsd-hackers@freefall.cdrom.com>; Tue, 25 Feb 1997 21:43:32 -0500 (EST)
Received: (from rivers@localhost) by lakes.water.net (8.8.3/8.6.9) id VAA19324 for freebsd-hackers@freefall.cdrom.com; Tue, 25 Feb 1997 21:48:47 -0500 (EST)
Date: Tue, 25 Feb 1997 21:48:47 -0500 (EST)
From: Thomas David Rivers <ponds!rivers@dg-rtp.dg.com>
Message-Id: <199702260248.VAA19324@lakes.water.net>
To: ponds!freefall.cdrom.com!freebsd-hackers
Subject: Re: More on bad dir panics
Content-Type: text
Sender: owner-hackers@FreeBSD.ORG
X-Loop: FreeBSD.org
Precedence: bulk


> 
> 
> I have been trying to look around the crash dumps, as they are plentiful
> these days (twice a day seems to be the current rate).  These always happen
> at the same point and all crashes are similar, crash occurs on directory
> lookup stombling over a block which contains something else than directory
> data.

 This "smells" very similar to my problems... perhaps we can devine
the intersection of these two problems and hit on a solution?

 Things I've determined:

	o) This can happen in a very light load.
	o) It happens on several types of hardware (SCSI, IDE, 386-586.)
	
 The problem appears to be related to inode allocation - in that an
inode is marked in the free inodes array as "available" (the bit isn't
set) and then, some other later code reads the data from the disk
and checks a field (for the "dup alloc" panic, it's the "mode" field)
and discovers that "oops - it, in fact was being used."

 Does that sound familiar?

 Some other interesting observations:

	o) This can happen with a brand-new file system; if you write
		trash the device, then do a newfs.  Newfs believes it
	  	has correctly filled in all the inodes with 0, but some
		(at least one in my tests) aren't correctly zero'd.

	o) The problem "strikes" and gets progressively worse until
		the file system simply falls apart.  I'm up to twice
		a day myself on my news server;  also, a find in
		/usr/spool/news now produces a lot of "Bad file descriptor"
		messages, indicating other file system problems that
		fsck didn't correct.

	o) Running fsck once isn't enough to restore a file system to
		a semi-usuable state; if you fsck it once, try again,
		you'll sometimes notice more corrections.

	o) This isn't "new" - it's something I've experience in all
		2.1 releases (although, until now, I was about the
		sole reporter of the problem.)  I mention this to try
		and narrow the scope of what we're looking for.  It was
		something that happened in the 2.1.0 time-frame.


	- Dave Rivers -