From owner-freebsd-fs@FreeBSD.ORG  Mon Nov 24 01:10:03 2003
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id F28E016A4CF
	for <fs@freebsd.org>; Mon, 24 Nov 2003 01:10:02 -0800 (PST)
Received: from vbook.fbsd.ru (asplinux.ru [195.133.213.194])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 912C243FCB
	for <fs@freebsd.org>; Mon, 24 Nov 2003 01:10:00 -0800 (PST)
	(envelope-from vova@vbook.fbsd.ru)
Received: from vova by vbook.fbsd.ru with local (Exim 4.24; FreeBSD)
	id 1AOCkn-0000E6-LO; Mon, 24 Nov 2003 12:11:33 +0300
From: "Vladimir B. Grebenschikov" <vova@fbsd.ru>
To: Erez Zadok <ezk@cs.sunysb.edu>
In-Reply-To: <200311211559.hALFxOLr015232@agora.fsl.cs.sunysb.edu>
References: <200311211559.hALFxOLr015232@agora.fsl.cs.sunysb.edu>
Content-Type: text/plain; charset=koi8-r
Content-Transfer-Encoding: quoted-printable
Organization: SWsoft Inc.
Message-Id: <1069665091.806.2.camel@localhost>
Mime-Version: 1.0
X-Mailer: Ximian Evolution 1.4.5 
Date: Mon, 24 Nov 2003 12:11:32 +0300
Sender: Vladimir Grebenschikov <vova@vbook.fbsd.ru>
cc: fs@freebsd.org
Subject: Re: "Reverse union" mount possible?
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 24 Nov 2003 09:10:03 -0000

=F7 =D0=D4, 21.11.2003, =D7 18:59, Erez Zadok =D0=C9=DB=C5=D4:
> > BTW, I've heard that nullfs/unionfs doesn't allow code sharing. Does wr=
apfs do it?
>=20
> What do you mean by "code sharing"?  Licensing?  All of the freebsd fist
> templates use the BSD license.

I guess he meant that same binary loaded from different unionfs/nullfs
mountpoints threated by kernel as different binaris from paging/mmap
point of view. (they have different vnodes)

> Erez.

--=20
Vladimir B. Grebenschikov <vova@fbsd.ru>
SWsoft Inc.

From owner-freebsd-fs@FreeBSD.ORG  Mon Nov 24 14:07:24 2003
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id 9DF1216A4CE; Mon, 24 Nov 2003 14:07:24 -0800 (PST)
Received: from sploot.vicor-nb.com (sploot.vicor-nb.com [208.206.78.81])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 18E0843FE1; Mon, 24 Nov 2003 14:07:23 -0800 (PST)
	(envelope-from kmarx@vicor.com)
Received: from vicor.com (localhost [127.0.0.1])
	by sploot.vicor-nb.com (8.12.8/8.12.8) with ESMTP id hAOM0u3g074364;
	Mon, 24 Nov 2003 14:00:56 -0800 (PST)
	(envelope-from kmarx@vicor.com)
Message-ID: <3FC27F98.8090801@vicor.com>
Date: Mon, 24 Nov 2003 14:00:56 -0800
From: Ken Marx <kmarx@vicor.com>
User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.6a) Gecko/20031105
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Don Lewis <truckman@FreeBSD.org>
References: <200311180347.hAI3lmeF089505@gw.catspoiler.org>
In-Reply-To: <200311180347.hAI3lmeF089505@gw.catspoiler.org>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
cc: freebsd-fs@FreeBSD.org
cc: mckusick@beastie.mckusick.com
Subject: Re: 4.8 ffs_dirpref problem
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 24 Nov 2003 22:07:24 -0000


Don Lewis wrote:
> On 17 Nov, Ken Marx wrote:
> 
>>
>>Don Lewis wrote:
> 
> 
>>>Ok, I'll do the commit as soon as I can do some testing on my -STABLE
>>>box.
>>>
>>
>>Great. Please let us know when this happens. In fact,
>>I kind of got lost which you were planning to commit.
>>Can you point me to it, and I'll do one last overnight run.
> 
> 
> I just committed version which sets minbfree to:
> 	max(1, avgbfree - avgbfree / 4)
> 
> You may want to continue to use the version that you are already running
> which sets minbfree to avgbfree.  I'm not committing my more complex
> version because it benchmarked worse for me than the version I
> committed.
> 
> I'm pretty sure that we can do better than this, but it will require a
> fair amount of tweaking and benchmarking, but for now this version
> should work a lot better than the previous version of the code.
> 
> 
>>>>I was able to run a couple more tests here, and *belive* that the
>>>>fix to the hash table in vfs_bio.c will provide some relief
>>>>for cg block searches when things do fall into the linear search case.
>>>
>>>
>>>I'll see about cranking out patch to use a Fibonacci hash.  It'll
>>>probably be a little while before I can find sufficient time, though.
>>>
>>
>>Ditto the above: thanks/keep us posted. Our clients are
>>anxious to have a 'final' kernel to run with. I think we'll
>>just give them what you commit, and sneak the hash fix in with
>>the security patch or some such. So, no rush, but do let me
>>know if you think it might happen sooner than, say, 2 weeks
>>so I can try and get it all in one release to them.
> 
> 
> I had some time to crank out a patch.  Give this a try and compare it to
> your hash patch.  It hasn't blown up my system, but I don't have any
> benchmark data on it.  You can just do the test where you fill the
> remaining space in the filesystem.  You won't need to do a newfs and
> start from scratch.  It would be great if you could compare the hash
> bucket sizes for the different versions of the hash.
> 
> 
> Index: sys/kern/vfs_bio.c
> ===================================================================
> RCS file: /home/ncvs/src/sys/kern/vfs_bio.c,v
> retrieving revision 1.242.2.21
> diff -u -r1.242.2.21 vfs_bio.c
> --- sys/kern/vfs_bio.c	9 Aug 2003 16:21:19 -0000	1.242.2.21
> +++ sys/kern/vfs_bio.c	18 Nov 2003 02:10:55 -0000
> @@ -140,6 +140,7 @@
>  	&bufreusecnt, 0, "");
>  
>  static int bufhashmask;
> +static int bufhashshift;
>  static LIST_HEAD(bufhashhdr, buf) *bufhashtbl, invalhash;
>  struct bqueues bufqueues[BUFFER_QUEUES] = { { 0 } };
>  char *buf_wmesg = BUF_WMESG;
> @@ -160,7 +161,20 @@
>  struct bufhashhdr *
>  bufhash(struct vnode *vnp, daddr_t bn)
>  {
> -	return(&bufhashtbl[(((uintptr_t)(vnp) >> 7) + (int)bn) & bufhashmask]);
> +	u_int64_t hashkey64;
> +	int hashkey; 
> +	
> +	/*
> +	 * Fibonacci hash, see Knuth's
> +	 * _Art of Computer Programming, Volume 3 / Sorting and Searching_
> +	 *
> +         * We reduce the argument to 32 bits before doing the hash to
> +	 * avoid the need for a slow 64x64 multiply on 32 bit platforms.
> +	 */
> +	hashkey64 = (u_int64_t)(uintptr_t)vnp + (u_int64_t)bn;
> +	hashkey = (((u_int32_t)(hashkey64 + (hashkey64 >> 32)) * 2654435769u) >>
> +	    bufhashshift) & bufhashmask;
> +	return(&bufhashtbl[hashkey]);
>  }
>  
>  /*
> @@ -319,8 +333,9 @@
>  bufhashinit(caddr_t vaddr)
>  {
>  	/* first, make a null hash table */
> +	bufhashshift = 29;
>  	for (bufhashmask = 8; bufhashmask < nbuf / 4; bufhashmask <<= 1)
> -		;
> +		bufhashshift--;
>  	bufhashtbl = (void *)vaddr;
>  	vaddr = vaddr + sizeof(*bufhashtbl) * bufhashmask;
>  	--bufhashmask;
> 
> 

Well, I'm mildly beflummoxed - I tried to compare hashtable preformance
between all three known versions of the hashing - legacy power of 2,
the Vicor ^= hash, and Don's fibonacci hash.

Running with 

       minifree = max( 1, avgifree / 4 );
       minbfree = max( 1, avgbfree );

all perform about the same, with no performance problems all
the way up to 100% disk capacity (didn't test into reserved space).

Looking at instrumentation to show freq and avg depth of the
hash buckets, everything seems very calm (mainly because
we're not hitting the linear searching very often, I'd presume).

I can't explain why I seemlingly got performance problems
with similar (identical) minbfree code previously.

So, out of spite, I went back to 

	minbfree = max( 1, avgbfree/4 );

This does hit the hashtable harder for the legacy version
and not so much for either new flavor. Here are a few
samplings of calling my dump routine from the debugger.
"avgdepth" really means 'search depth' since we use
the depth reached after finding a bp in gbincore.

The line below such as,

	 0: avgdepth[1] cnt=801

means that 801 of the hashtable buckets had an avg search
depth of 1 at the time the debug routine was called.
The 'N:' prefix means the N-th unique non-zero such value.
So large cnt's for small []'d depth values means an efficient hash.

I've edited out the details as much as possible.

LEGACY:
--------
Nov 24 13:34:54 oos0b /kernel: bh[442/0x1ba]: freq=2706110, avgdepth = 154
...
Nov 24 13:34:54 oos0b /kernel: 0: avgdepth[1] cnt=1015
Nov 24 13:34:54 oos0b /kernel: 1: avgdepth[2] cnt=7
Nov 24 13:34:54 oos0b /kernel: 2: avgdepth[154] cnt=1	<- !!
Nov 24 13:34:54 oos0b /kernel: 3: avgdepth[3] cnt=1
 -----------

Nov 24 13:36:49 oos0b /kernel: bh[442/0x1ba]: freq=3416953, avgdepth = 141
...
Nov 24 13:36:49 oos0b /kernel: 0: avgdepth[1] cnt=1017
Nov 24 13:36:49 oos0b /kernel: 1: avgdepth[141] cnt=1
Nov 24 13:36:49 oos0b /kernel: 2: avgdepth[2] cnt=6

VICOR x-or hashtable:
---------------------
Nov 24 13:07:24 oos0b /kernel: 0: avgdepth[1] cnt=762
Nov 24 13:07:24 oos0b /kernel: 1: avgdepth[2] cnt=259
Nov 24 13:07:24 oos0b /kernel: 2: avgdepth[3] cnt=3
 -----------

Nov 24 13:08:07 oos0b /kernel: 0: avgdepth[1] cnt=744
Nov 24 13:08:07 oos0b /kernel: 1: avgdepth[2] cnt=275
Nov 24 13:08:07 oos0b /kernel: 2: avgdepth[3] cnt=5

FIBONACCI:
----------
Nov 24 11:56:50 oos0b /kernel: 0: avgdepth[1] cnt=811
Nov 24 11:56:50 oos0b /kernel: 1: avgdepth[3] cnt=88
Nov 24 11:56:50 oos0b /kernel: 2: avgdepth[2] cnt=124
Nov 24 11:56:50 oos0b /kernel: 3: avgdepth[0] cnt=1
 -----------

Nov 24 11:57:48 oos0b /kernel: 0: avgdepth[1] cnt=801
Nov 24 11:57:48 oos0b /kernel: 1: avgdepth[3] cnt=93
Nov 24 11:57:48 oos0b /kernel: 2: avgdepth[2] cnt=130

So, while this is far from analytically eshaustive,
it almost appears the fibonacci hash has more entries
of depth 3, while the Vicor one has more at depth 2.

I'm happy to run more tests if you have ideas. I'm also fine
to cut bait and go with whatever you decide. It *seems* like
putting the fibonacci hash is prudent since the current hash
has been observed to be expensive. I had trouble proving this
unequivocally though. So, perhaps Don's minbfree fix is sufficient
after all. I'm tempted at this point to go with the 100% flavor.

Apologies for the delays and any confusion,
k
--
Ken Marx, kmarx@vicor-nb.com
If we form a subcomittee we will reach agreement and stop beating around the 
bush on the bandwith issues.
		- http://www.bigshed.com/cgi-bin/speak.cgi

From owner-freebsd-fs@FreeBSD.ORG  Tue Nov 25 12:03:14 2003
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id AD22216A4CE
	for <fs@freebsd.org>; Tue, 25 Nov 2003 12:03:14 -0800 (PST)
Received: from filer.fsl.cs.sunysb.edu (filer.fsl.cs.sunysb.edu
	[130.245.126.2])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 1AC6443FE1
	for <fs@freebsd.org>; Tue, 25 Nov 2003 12:03:13 -0800 (PST)
	(envelope-from ezk@fsl.cs.sunysb.edu)
Received: from agora.fsl.cs.sunysb.edu
	(IDENT:kKAxUarwB3KkbM9iDWzCUB32hhq31iUR@agora.fsl.cs.sunysb.edu
	[130.245.126.12])hAPK32Hn028366
	for <fs@freebsd.org>; Tue, 25 Nov 2003 15:03:02 -0500
Received: from agora.fsl.cs.sunysb.edu
	(IDENT:vDCGSRndak151TKTLFKR08pIYeYTm+l0@localhost.localdomain [127.0.0.1])
	hAPK3Bg9017040;	Tue, 25 Nov 2003 15:03:11 -0500
Received: (from ezk@localhost)
	by agora.fsl.cs.sunysb.edu (8.12.8/8.12.8/Submit) id hAPK3Bb9017036;
	Tue, 25 Nov 2003 15:03:11 -0500
Date: Tue, 25 Nov 2003 15:03:11 -0500
Message-Id: <200311252003.hAPK3Bb9017036@agora.fsl.cs.sunysb.edu>
From: Erez Zadok <ezk@cs.sunysb.edu>
To: fs@freebsd.org
X-MailKey: Erez_Zadok
Subject: vnode refcnt bug?
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 25 Nov 2003 20:03:14 -0000

Please see this short thread of discussion on amd-dev.  I've included two
messages from this thread.  It suggests that fbsd5 may have a vnode refcount
bug (a vnode isn't held where it should).

I've not personally investigated this bug.  Does anyone on fs@ has come
across such a possible bug?

Thanks
Erez.

------- Forwarded Message

Date:    Tue, 25 Nov 2003 18:41:40 +0100
From:    scholler@fnb.tu-darmstadt.de (Ulrich Scholler)
To:      amd-dev@cs.columbia.edu
Subject: Re: amd unmounting out from under a cwd

Hi,

On Tue Nov 25, 2003 at 12:29:27 -0500, Andrew Siegel wrote:
> What's happening is: I cd into an automounted directory,
> the mount occurs normally, and I leave my shell there.
> 5 minutes later, amd unmounts that mount point out from
> under me.  The shell no longer has a working directory:
> 
>    zayin abs % pwd
>    pwd: .: No such file or directory
> 
> 
> I have several hundred machines here (Redhat 7.3, IRIX 6.5.12,
> FreeBSD 4) running amd (mostly 6.0.9), all with the same
> configuration file and maps, and this is the only one that
> shows this problem, making me think it's something about
> FreeBSD 5.1.
> 
> I'm attaching a debugging log.  The directory that is being
> mounted and then incorrectly unmounted is /u/abs.

Are you sure that your shell's cwd is actually /u/abs?  Some programs
seem to dereference the symlink and set the cwd to the actual mount
point.  amd is perfectly right to unmount it, since it is not accessed
via the amd-provided symlink.

Regards,

uLI
_______________________________________________
amd-dev mailing list: amd-dev@cs.columbia.edu
Am-utils: http://www.am-utils.org


------- End of Forwarded Message

------- Forwarded Message

Date:    Tue, 25 Nov 2003 11:24:42 -0700
From:    John E Hein <jhein@timing.com>
To:      Andrew Siegel <abs@blueskystudios.com>
Cc:      amd-dev@cs.columbia.edu
Subject: amd unmounting out from under a cwd

Andrew Siegel wrote at 12:29 -0500 on Nov 25:
 > I've got a problem that I've never seen before with amd
 > under FreeBSD 5.1.  Versions 6.0.7 (as delivered with the
 > FreeBSD 5.1 distribution) and 6.1b4 (compiled by me) share
 > this problem.

Definitely a FreeBSD 5.* problem.  I've noticed it since using early
versions of 5.  I haven't tracked it down yet since it's been more of
an inconvenience than anything (for instance, 'pushd /tmp ; popd'
"fixes" it).
_______________________________________________
amd-dev mailing list: amd-dev@cs.columbia.edu
Am-utils: http://www.am-utils.org


------- End of Forwarded Message

From owner-freebsd-fs@FreeBSD.ORG  Tue Nov 25 13:07:34 2003
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 1362316A4CE
	for <fs@freebsd.org>; Tue, 25 Nov 2003 13:07:34 -0800 (PST)
Received: from salmon.maths.tcd.ie (salmon.maths.tcd.ie [134.226.81.11])
	by mx1.FreeBSD.org (Postfix) with SMTP id 9DF7043F93
	for <fs@freebsd.org>; Tue, 25 Nov 2003 13:07:30 -0800 (PST)
	(envelope-from iedowse@maths.tcd.ie)
Received: from walton.maths.tcd.ie by salmon.maths.tcd.ie with SMTP
          id <aa96370@salmon>; 25 Nov 2003 21:07:29 +0000 (GMT)
To: Erez Zadok <ezk@cs.sunysb.edu>
In-Reply-To: Your message of "Tue, 25 Nov 2003 15:03:11 EST."
             <200311252003.hAPK3Bb9017036@agora.fsl.cs.sunysb.edu> 
Date: Tue, 25 Nov 2003 21:07:29 +0000
From: Ian Dowse <iedowse@maths.tcd.ie>
Message-ID: <200311252107.aa96370@salmon.maths.tcd.ie>
cc: fs@freebsd.org
Subject: Re: vnode refcnt bug? 
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 25 Nov 2003 21:07:34 -0000

In message <200311252003.hAPK3Bb9017036@agora.fsl.cs.sunysb.edu>, Erez Zadok wr
ites:
>Please see this short thread of discussion on amd-dev.  I've included two
>messages from this thread.  It suggests that fbsd5 may have a vnode refcount
>bug (a vnode isn't held where it should).
>
>I've not personally investigated this bug.  Does anyone on fs@ has come
>across such a possible bug?

Hmm, I guess it is caused by checkdirs() in vfs_mount.c moving the
process cwd to the underlying vnode before attempting the unmount.
Does this only happen if the cwd is at the mount point itself?

When a file system is first mounted, checkdirs() looks for processes
that had a cwd or chroot set to the vnode that is about to be
covered.  It moves these processes to the new mountpoint vnode.
This behaviour goes back a long time (I'm not sure what the reasons
were), but it had the problem that you would get a "Device busy"
error if you attempted to unmount the file system later, and a
forced unmount would leave the process with a stale cwd or chroot
vnode (i.e.  "mount /mnt; umount /mnt" would fail if any processes
previously had a cwd of /mnt, and "mount /mnt; umount -f /mnt" would
cause such processes to lose their reference to the /mnt directory).

More recently (Feb 2001), I changed unmount to undo the checkdirs()
step so that processes with a cwd or chroot at the mount point get
moved back to the covered vnode before the unmount is attempted.
This fixes the two issues, but it has the side-effect that if the
only vnode references to a file system are processes whose cwd or
chroot directory is on the mountpoint, then the unmount will succeed,
and those processes will be moved to the underlying directory.

The reference count checks could be moved to before checkdirs(),
but I think there are cases where the current behaviour is preferable,
so maybe it needs to be an unmount() flag...  BTW, does amd delete
the mountpoint directory after the unmount? That would explain why
the directory goes away entirely.

Ian

From owner-freebsd-fs@FreeBSD.ORG  Tue Nov 25 13:24:21 2003
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 02F0516A4CE
	for <fs@freebsd.org>; Tue, 25 Nov 2003 13:24:21 -0800 (PST)
Received: from filer.fsl.cs.sunysb.edu (filer.fsl.cs.sunysb.edu
	[130.245.126.2])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 9E44243FE9
	for <fs@freebsd.org>; Tue, 25 Nov 2003 13:24:16 -0800 (PST)
	(envelope-from ezk@fsl.cs.sunysb.edu)
Received: from agora.fsl.cs.sunysb.edu
	(IDENT:uZmDaDLDMapOqQ++SOtAtXyBgQpmMENs@agora.fsl.cs.sunysb.edu
	[130.245.126.12])hAPLMIHn029059;	Tue, 25 Nov 2003 16:22:18 -0500
Received: from agora.fsl.cs.sunysb.edu
	(IDENT:MbDxfAsrbfpIJzB2czikWlNTDFcDQ1hw@localhost.localdomain [127.0.0.1])
	hAPLMRg9018538;	Tue, 25 Nov 2003 16:22:27 -0500
Received: (from ezk@localhost)
	by agora.fsl.cs.sunysb.edu (8.12.8/8.12.8/Submit) id hAPLMRfE018534;
	Tue, 25 Nov 2003 16:22:27 -0500
Date: Tue, 25 Nov 2003 16:22:27 -0500
Message-Id: <200311252122.hAPLMRfE018534@agora.fsl.cs.sunysb.edu>
From: Erez Zadok <ezk@cs.sunysb.edu>
To: Ian Dowse <iedowse@maths.tcd.ie>
In-reply-to: Your message of "Tue, 25 Nov 2003 21:07:29 GMT."
             <200311252107.aa96370@salmon.maths.tcd.ie> 
X-MailKey: Erez_Zadok
cc: amd-dev@cs.columbia.edu
cc: Erez Zadok <ezk@cs.sunysb.edu>
cc: fs@freebsd.org
Subject: Re: vnode refcnt bug? 
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 25 Nov 2003 21:24:21 -0000

Ian, I'm CC-ing my reply to the am-utils developers mailing list, amd-dev.
Let's keep this thread on both fs@ and amd-dev for a bit.

Can the people on amd-dev who noticed this problem please answer Ian's
questions?

In message <200311252107.aa96370@salmon.maths.tcd.ie>, Ian Dowse writes:
> In message <200311252003.hAPK3Bb9017036@agora.fsl.cs.sunysb.edu>, Erez Zadok wr
> ites:
> >Please see this short thread of discussion on amd-dev.  I've included two
> >messages from this thread.  It suggests that fbsd5 may have a vnode refcount
> >bug (a vnode isn't held where it should).
> >
> >I've not personally investigated this bug.  Does anyone on fs@ has come
> >across such a possible bug?
> 
> Hmm, I guess it is caused by checkdirs() in vfs_mount.c moving the
> process cwd to the underlying vnode before attempting the unmount.
> Does this only happen if the cwd is at the mount point itself?
> 
> When a file system is first mounted, checkdirs() looks for processes
> that had a cwd or chroot set to the vnode that is about to be
> covered.  It moves these processes to the new mountpoint vnode.
> This behaviour goes back a long time (I'm not sure what the reasons
> were), but it had the problem that you would get a "Device busy"
> error if you attempted to unmount the file system later, and a
> forced unmount would leave the process with a stale cwd or chroot
> vnode (i.e.  "mount /mnt; umount /mnt" would fail if any processes
> previously had a cwd of /mnt, and "mount /mnt; umount -f /mnt" would
> cause such processes to lose their reference to the /mnt directory).
> 
> More recently (Feb 2001), I changed unmount to undo the checkdirs()
> step so that processes with a cwd or chroot at the mount point get
> moved back to the covered vnode before the unmount is attempted.
> This fixes the two issues, but it has the side-effect that if the
> only vnode references to a file system are processes whose cwd or
> chroot directory is on the mountpoint, then the unmount will succeed,
> and those processes will be moved to the underlying directory.

Hmmm, yes I think that could be a serious problem (esp. since fbsd doesn't
have autofs yet).  And I think it deviates from "norms" where a cwd is
essentially occupying a vnode within the mounted f/s and therefore the f/s
shouldn't be unmounted!  This is rather bad for users who sit on an nfs mnt
point, ls'ing files happily, and then the kernel unmounts the mnt pt, moves
their cwd down to the covered (typically empty) vnode, and the poor user's
next /bin/ls shows nothing.

Personally, having dealt w/ stackable f/s for a while, I found that when the
kernel tries to do all sorts from "under the feet" of the application (or
any other upper-layer kernel component), it opens up avenues for trouble.
Yes, maybe an un/mount() flag will solve this issue.  But I'd like to see
the more normal EBUSY-on-cwd behavior restored, and an un/mount flag for
those who really want the new behavior.

I'm a big proponent of backwards compatibility, and new features gradually
introduced through flags/options.  And if I want to force an unmount of an
mnt pt and I get EBUSY, I do lsof and then /bin/kill any process sitting on
the mnt pt; that's expected behavior (what does POSIX say?)

> The reference count checks could be moved to before checkdirs(),
> but I think there are cases where the current behaviour is preferable,
> so maybe it needs to be an unmount() flag...  BTW, does amd delete
> the mountpoint directory after the unmount? That would explain why
> the directory goes away entirely.

If Amd created the mount point when it started (say, the mnt pt didn't
exist), then Amd will also try to rmdir it upon unmount.

> Ian

Cheers,
Erez.

From owner-freebsd-fs@FreeBSD.ORG  Tue Nov 25 15:38:46 2003
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id E4C7016A4CE
	for <fs@freebsd.org>; Tue, 25 Nov 2003 15:38:46 -0800 (PST)
Received: from salmon.maths.tcd.ie (salmon.maths.tcd.ie [134.226.81.11])
	by mx1.FreeBSD.org (Postfix) with SMTP id 999FA43F93
	for <fs@freebsd.org>; Tue, 25 Nov 2003 15:38:45 -0800 (PST)
	(envelope-from iedowse@maths.tcd.ie)
Received: from walton.maths.tcd.ie by salmon.maths.tcd.ie with SMTP
          id <aa05451@salmon>; 25 Nov 2003 23:38:45 +0000 (GMT)
To: Erez Zadok <ezk@cs.sunysb.edu>
In-Reply-To: Your message of "Tue, 25 Nov 2003 16:22:27 EST."
             <200311252122.hAPLMRfE018534@agora.fsl.cs.sunysb.edu> 
Date: Tue, 25 Nov 2003 23:38:44 +0000
From: Ian Dowse <iedowse@maths.tcd.ie>
Message-ID: <200311252338.aa05451@salmon.maths.tcd.ie>
cc: amd-dev@cs.columbia.edu
cc: fs@freebsd.org
Subject: Re: vnode refcnt bug? 
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 25 Nov 2003 23:38:47 -0000

In message <200311252122.hAPLMRfE018534@agora.fsl.cs.sunysb.edu>, Erez Zadok wr
ites:
>Hmmm, yes I think that could be a serious problem (esp. since fbsd doesn't
>have autofs yet).  And I think it deviates from "norms" where a cwd is
>essentially occupying a vnode within the mounted f/s and therefore the f/s
>shouldn't be unmounted!  This is rather bad for users who sit on an nfs mnt
>point, ls'ing files happily, and then the kernel unmounts the mnt pt, moves
>their cwd down to the covered (typically empty) vnode, and the poor user's
>next /bin/ls shows nothing.

Yes, I agree completely - however the question of what to do with
references to about-to-be-covered vnodes at mount time still remains.
I'll have to look in more detail at why the checkdirs() approach
was needed in the first place to see if simply removing it is an
option.

Any other approaches I can think of right now for solving this issue
appear to either extend the original checkdirs() hack, or else just
replace one kind of undesirable behaviour with another.

Ian

From owner-freebsd-fs@FreeBSD.ORG  Tue Nov 25 15:58:00 2003
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 0B4DB16A4CE
	for <fs@freebsd.org>; Tue, 25 Nov 2003 15:58:00 -0800 (PST)
Received: from filer.fsl.cs.sunysb.edu (filer.fsl.cs.sunysb.edu
	[130.245.126.2])
	by mx1.FreeBSD.org (Postfix) with ESMTP id CB8A743F75
	for <fs@freebsd.org>; Tue, 25 Nov 2003 15:57:58 -0800 (PST)
	(envelope-from ezk@fsl.cs.sunysb.edu)
Received: from agora.fsl.cs.sunysb.edu
	(IDENT:YREcBNQLJcxYLWN5xNGmmOApwO5syPLM@agora.fsl.cs.sunysb.edu
	[130.245.126.12])hAPNvbHn032392;	Tue, 25 Nov 2003 18:57:37 -0500
Received: from agora.fsl.cs.sunysb.edu
	(IDENT:Tzbi/6YTV7V/aCD1Ai3EmmDzjzsaIhqo@localhost.localdomain [127.0.0.1])
	hAPNvlg9021313;	Tue, 25 Nov 2003 18:57:47 -0500
Received: (from ezk@localhost)
	by agora.fsl.cs.sunysb.edu (8.12.8/8.12.8/Submit) id hAPNvlGs021309;
	Tue, 25 Nov 2003 18:57:47 -0500
Date: Tue, 25 Nov 2003 18:57:47 -0500
Message-Id: <200311252357.hAPNvlGs021309@agora.fsl.cs.sunysb.edu>
From: Erez Zadok <ezk@cs.sunysb.edu>
To: Ian Dowse <iedowse@maths.tcd.ie>
In-reply-to: Your message of "Tue, 25 Nov 2003 23:38:44 GMT."
             <200311252338.aa05451@salmon.maths.tcd.ie> 
X-MailKey: Erez_Zadok
cc: amd-dev@cs.columbia.edu
cc: Erez Zadok <ezk@cs.sunysb.edu>
cc: fs@freebsd.org
Subject: Re: vnode refcnt bug? 
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 25 Nov 2003 23:58:00 -0000

In message <200311252338.aa05451@salmon.maths.tcd.ie>, Ian Dowse writes:
> In message <200311252122.hAPLMRfE018534@agora.fsl.cs.sunysb.edu>, Erez Zadok wr
> ites:
> >Hmmm, yes I think that could be a serious problem (esp. since fbsd doesn't
> >have autofs yet).  And I think it deviates from "norms" where a cwd is
> >essentially occupying a vnode within the mounted f/s and therefore the f/s
> >shouldn't be unmounted!  This is rather bad for users who sit on an nfs mnt
> >point, ls'ing files happily, and then the kernel unmounts the mnt pt, moves
> >their cwd down to the covered (typically empty) vnode, and the poor user's
> >next /bin/ls shows nothing.
> 
> Yes, I agree completely - however the question of what to do with
> references to about-to-be-covered vnodes at mount time still remains.
> I'll have to look in more detail at why the checkdirs() approach
> was needed in the first place to see if simply removing it is an
> option.

If you have a cwd on a lower mnt pt before the mount, I'd say it makes
_some_ sense to move it "up" to the mnt pt (root vnode) of the newly mounted
fs.  This could be very useful for, say, a login shell.

I say "some" b/c I'm concerned about the possibility that some bad process
(rm -rf) that is just started in an emoty mnt point, all of sudden is moved
up to a vnode full of real files, and that process may happily go on to
delete the files in the newly mounted f/s.

Doing the reverse upon unmount (moving the cwd from upper to lower) sounds
even stranger to me.  Why?  B/c the process used to see some files and now
it sees none.  Where did it all go?  This can break applications in all
sorts of unhappy ways.

> Any other approaches I can think of right now for solving this issue
> appear to either extend the original checkdirs() hack, or else just
> replace one kind of undesirable behaviour with another.

My personal philosophy when it comes to a choice b/t several un/desirable
modes of operations is the following:

1. Offer flags/options/whatever for users to pick their desired behavior.

2. Don't break existing "expected" behavior: make that the default mode of
   operation.

3. In some cases, it's desirable to change the default behavior to one of
   the "new modes".  But at least everyone will have a way to get the
   behavior they want.

4. Disadvantage: poor programmers/maintainers have to keep several modes of
   operation working.

The above won't make everyone happy, but it'd maximize the percentage of
happy users.

> Ian

I guess we first need to find out what were the original reasons for the
change in fbsd.  Maybe we can find a way to accommodate the needs for that
change w/o breaking functionality.

Cheers,
Erez.

From owner-freebsd-fs@FreeBSD.ORG  Tue Nov 25 20:28:18 2003
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 8EEAB16A4CE
	for <freebsd-fs@freebsd.org>; Tue, 25 Nov 2003 20:28:18 -0800 (PST)
Received: from itree.org (tree.caddev.com [24.153.136.6])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 5619543FE0
	for <freebsd-fs@freebsd.org>; Tue, 25 Nov 2003 20:28:15 -0800 (PST)
	(envelope-from treeml@itree.org)
Received: from laptop (user-0cdfduk.cable.mindspring.com [24.215.183.212])
	by itree.org (8.11.6/8.11.6) with SMTP id hAQ4WRX00821
	for <freebsd-fs@freebsd.org>; Tue, 25 Nov 2003 22:32:28 -0600
From: "treeml" <treeml@itree.org>
To: <freebsd-fs@freebsd.org>
Date: Tue, 25 Nov 2003 23:24:18 -0500
Message-ID: <AOEMLAIJIPAIPHJEAHBOGELMDIAA.treeml@itree.org>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3 (Normal)
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0)
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165
Importance: Normal
Subject: SEARCH FOR ALTERNATE SUPER-BLOCK FAILED
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Nov 2003 04:28:18 -0000

My machine is a FreeBSD 5.1-Relase, with either UFS or UFS2 filesystem.


I must have switch off the electricity before the machine finishes shutting
down.  (I did "shutdown -h now", and waited at least 5 mins before I turned
off the switch) Now the /usr partition won't mount.  In the past 24 hr, I
have look all over the Internet, and try all the recommendations.  Nothing
seems to work. Following are errors I got.

The /usr is partition on "/dev/ad0s1f".

 -----------------------------------
-su-2.05b# mount /dev/ad0s1f /mnt/
-------------------------------------


When I try to fsck the partition I get the following errors,

---------------------------------------
bash-2.05b# fsck dev/ad0s1f
** /dev/ad0s1f

CANNOT READ BLK: 114411168
CONTINUE? [yn] y

THE FOLLOWING DISK SECTORS COULD NOT BE READ: 114411168, 114411169,
114411170, 114411171,

LOOK FOR ALTERNATE SUPERBLOCKS? [yn] y

32 is not a file system superblock
SEARCH FOR ALTERNATE SUPER-BLOCK FAILED. YOU MUST USE THE
-b OPTION TO FSCK TO SPECIFY THE LOCATION OF AN ALTERNATE
SUPER-BLOCK TO SUPPLY NEEDED INFORMATION; SEE fsck(8).

bash-2.05b# fsck dev/ad0s1f
** /dev/ad0s1f

CANNOT READ BLK: 114411168
CONTINUE? [yn] y

THE FOLLOWING DISK SECTORS COULD NOT BE READ: 114411168, 114411169,
114411170, 114411171,

LOOK FOR ALTERNATE SUPERBLOCKS? [yn] y

32 is not a file system superblock
SEARCH FOR ALTERNATE SUPER-BLOCK FAILED. YOU MUST USE THE
-b OPTION TO FSCK TO SPECIFY THE LOCATION OF AN ALTERNATE
SUPER-BLOCK TO SUPPLY NEEDED INFORMATION; SEE fsck(8).
---------------------------------------


I have also try,
---------------------------------------
dd if=/dev/ad0s1f skip=32 of=/dev/ad0s1f seek=16 bs=512 count=16
---------------------------------------

also no luck.


Does anyone know how I can get the parition mounted or just to partially
recover the data from that partition?

Thanks in advance

Tree


From owner-freebsd-fs@FreeBSD.ORG  Tue Nov 25 21:00:00 2003
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 7E51216A4CE
	for <fs@freebsd.org>; Tue, 25 Nov 2003 21:00:00 -0800 (PST)
Received: from Daffy.timing.com (mx2.timing.com [206.168.13.218])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 9AD3B43FBF
	for <fs@freebsd.org>; Tue, 25 Nov 2003 20:59:58 -0800 (PST)
	(envelope-from jhein@timing.com)
Received: from gromit.timing.com (gromit.timing.com [206.168.13.209])
	by Daffy.timing.com (8.12.8p2/8.12.8) with ESMTP id hAQ4xjpB036497;
	Tue, 25 Nov 2003 21:59:45 -0700 (MST)
	(envelope-from jhein@timing.com)
Received: from gromit.timing.com (localhost [127.0.0.1])
	by gromit.timing.com (8.12.6p3/8.12.6) with ESMTP id hAQ4xfjh074733;
	Tue, 25 Nov 2003 21:59:41 -0700 (MST)
	(envelope-from jhein@gromit.timing.com)
Received: (from jhein@localhost)
	by gromit.timing.com (8.12.6p3/8.12.6/Submit) id hAQ4xfRw074730;
	Tue, 25 Nov 2003 21:59:41 -0700 (MST)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <16324.13117.190129.769195@gromit.timing.com>
Date: Tue, 25 Nov 2003 21:59:41 -0700
X-Mailer: VM 7.17 under Emacs 21.1.1
From: John E Hein <jhein@timing.com>
To: Erez Zadok <ezk@cs.sunysb.edu>
In-Reply-To: <200311252122.hAPLMRfE018534@agora.fsl.cs.sunysb.edu>
References: <200311252107.aa96370@salmon.maths.tcd.ie>
	<200311252122.hAPLMRfE018534@agora.fsl.cs.sunysb.edu>
X-Spam-Status: No, hits=-15.6 required=5.0
	tests=IN_REP_TO,REFERENCES,USER_AGENT_VM
	version=2.50
X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp)
cc: amd-dev@cs.columbia.edu
cc: Ian Dowse <iedowse@maths.tcd.ie>
cc: fs@freebsd.org
Subject: Re: vnode refcnt bug? 
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Nov 2003 05:00:00 -0000

Erez Zadok wrote at 16:22 -0500 on Nov 25:
 > Ian, I'm CC-ing my reply to the am-utils developers mailing list, amd-dev.
 > Let's keep this thread on both fs@ and amd-dev for a bit.
 > 
 > Can the people on amd-dev who noticed this problem please answer Ian's
 > questions?
 > 
 > In message <200311252107.aa96370@salmon.maths.tcd.ie>, Ian Dowse writes:
 > > In message <200311252003.hAPK3Bb9017036@agora.fsl.cs.sunysb.edu>, Erez Zadok wr
 > > ites:
 > > >Please see this short thread of discussion on amd-dev.  I've included two
 > > >messages from this thread.  It suggests that fbsd5 may have a vnode refcount
 > > >bug (a vnode isn't held where it should).
 > > >
 > > >I've not personally investigated this bug.  Does anyone on fs@ has come
 > > >across such a possible bug?
 > > 
 > > Hmm, I guess it is caused by checkdirs() in vfs_mount.c moving the
 > > process cwd to the underlying vnode before attempting the unmount.
 > > Does this only happen if the cwd is at the mount point itself?

Yes.  It appears that's the case.  I can force it to happen with amq -u.


 > > When a file system is first mounted, checkdirs() looks for processes
 > > that had a cwd or chroot set to the vnode that is about to be
 > > covered.  It moves these processes to the new mountpoint vnode.
 > > This behaviour goes back a long time (I'm not sure what the reasons
 > > were), but it had the problem that you would get a "Device busy"
 > > error if you attempted to unmount the file system later, and a
 > > forced unmount would leave the process with a stale cwd or chroot
 > > vnode (i.e.  "mount /mnt; umount /mnt" would fail if any processes
 > > previously had a cwd of /mnt, and "mount /mnt; umount -f /mnt" would
 > > cause such processes to lose their reference to the /mnt directory).

No forced umount is necessary.  It just gets unmounted after the amd
timeout if you just sit at your shell prompt and wait (or amq -u).


 > > The reference count checks could be moved to before checkdirs(),
 > > but I think there are cases where the current behaviour is preferable,
 > > so maybe it needs to be an unmount() flag...  BTW, does amd delete
 > > the mountpoint directory after the unmount? That would explain why
 > > the directory goes away entirely.
 > 
 > If Amd created the mount point when it started (say, the mnt pt didn't
 > exist), then Amd will also try to rmdir it upon unmount.

It gets unmounted first.
Then within a minute, it gets deleted.

ls returns nothing (but exit code is 0).
pwd gives:
pwd: .: No such file or directory

From owner-freebsd-fs@FreeBSD.ORG  Wed Nov 26 02:13:42 2003
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 9EC3F16A4CE
	for <fs@freebsd.org>; Wed, 26 Nov 2003 02:13:42 -0800 (PST)
Received: from salmon.maths.tcd.ie (salmon.maths.tcd.ie [134.226.81.11])
	by mx1.FreeBSD.org (Postfix) with SMTP id 5B2D843FB1
	for <fs@freebsd.org>; Wed, 26 Nov 2003 02:13:41 -0800 (PST)
	(envelope-from iedowse@maths.tcd.ie)
Received: from walton.maths.tcd.ie by salmon.maths.tcd.ie with SMTP
          id <aa21508@salmon>; 26 Nov 2003 10:13:40 +0000 (GMT)
To: Erez Zadok <ezk@cs.sunysb.edu>
In-Reply-To: Your message of "Tue, 25 Nov 2003 18:57:47 EST."
             <200311252357.hAPNvlGs021309@agora.fsl.cs.sunysb.edu> 
Date: Wed, 26 Nov 2003 10:13:39 +0000
From: Ian Dowse <iedowse@maths.tcd.ie>
Message-ID: <200311261013.aa21508@salmon.maths.tcd.ie>
cc: amd-dev@cs.columbia.edu
cc: fs@freebsd.org
Subject: Re: vnode refcnt bug? 
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Nov 2003 10:13:42 -0000

In message <200311252357.hAPNvlGs021309@agora.fsl.cs.sunysb.edu>, Erez Zadok wr
ites:
>If you have a cwd on a lower mnt pt before the mount, I'd say it makes
>_some_ sense to move it "up" to the mnt pt (root vnode) of the newly mounted
>fs.  This could be very useful for, say, a login shell.
>
>I say "some" b/c I'm concerned about the possibility that some bad process
>(rm -rf) that is just started in an emoty mnt point, all of sudden is moved
>up to a vnode full of real files, and that process may happily go on to
>delete the files in the newly mounted f/s.
>
>Doing the reverse upon unmount (moving the cwd from upper to lower) sounds
>even stranger to me.  Why?  B/c the process used to see some files and now
>it sees none.  Where did it all go?  This can break applications in all
>sorts of unhappy ways.

Whether or not checkdirs() is retained, I think it is just good
practice to undo at unmount time anything that was done when the
filesystem was mounted. An obvious case is if you accidentally mount
a file system in the wrong place or make the common mistake of
typing "mount -a" when there are NFS entries in fstab that are
already mounted. Without the unmount-time checkdirs call, this is
an operation that cannot be undone because any processes that had
a cwd of the covered vnode before the mount will lose their cwd
entirely if you unmount it.

There were also some obscure cases involving booting frem CD and
then mounting the real root filesystem directly over /. If you
unmount it later, all processes would lose their fd_rdir references
to /, so they suddenly become chrooted into a dead vnode even though
their original root directory on the CD root still exists.

Anyway, I think the best solution for now is to make the checkdirs()
at unmount time conditional on the MNT_FORCE flag. This should fix
amd's EBUSY detection while still making it possible to fully undo
the effects of a mount operation. The change is fairly trivial, so
I'll see if I can get something committed before 5.2 is released.

Ian

From owner-freebsd-fs@FreeBSD.ORG  Wed Nov 26 10:35:25 2003
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id CDE6D16A4CE
	for <fs@freebsd.org>; Wed, 26 Nov 2003 10:35:25 -0800 (PST)
Received: from filer.fsl.cs.sunysb.edu (filer.fsl.cs.sunysb.edu
	[130.245.126.2])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 85E1043FB1
	for <fs@freebsd.org>; Wed, 26 Nov 2003 10:35:24 -0800 (PST)
	(envelope-from ezk@fsl.cs.sunysb.edu)
Received: from agora.fsl.cs.sunysb.edu
	(IDENT:aU2y3hMo8H4hncdHVTSY7dwMiY1HEzym@agora.fsl.cs.sunysb.edu
	[130.245.126.12])hAQIYuHn017875;	Wed, 26 Nov 2003 13:34:56 -0500
Received: from agora.fsl.cs.sunysb.edu
	(IDENT:aDFs0EZBvK52o3HkWz/NHRiWiX5FjqCp@localhost.localdomain [127.0.0.1])
	hAQIZ8g9002674;	Wed, 26 Nov 2003 13:35:08 -0500
Received: (from ezk@localhost)
	by agora.fsl.cs.sunysb.edu (8.12.8/8.12.8/Submit) id hAQIZ8E0002670;
	Wed, 26 Nov 2003 13:35:08 -0500
Date: Wed, 26 Nov 2003 13:35:08 -0500
Message-Id: <200311261835.hAQIZ8E0002670@agora.fsl.cs.sunysb.edu>
From: Erez Zadok <ezk@cs.sunysb.edu>
To: Ian Dowse <iedowse@maths.tcd.ie>
In-reply-to: Your message of "Wed, 26 Nov 2003 10:13:39 GMT."
             <200311261013.aa21508@salmon.maths.tcd.ie> 
X-MailKey: Erez_Zadok
cc: amd-dev@cs.columbia.edu
cc: Erez Zadok <ezk@cs.sunysb.edu>
cc: fs@freebsd.org
Subject: Re: vnode refcnt bug? 
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Nov 2003 18:35:26 -0000

In message <200311261013.aa21508@salmon.maths.tcd.ie>, Ian Dowse writes:
> In message <200311252357.hAPNvlGs021309@agora.fsl.cs.sunysb.edu>, Erez Zadok wr
> ites:
> >If you have a cwd on a lower mnt pt before the mount, I'd say it makes
> >_some_ sense to move it "up" to the mnt pt (root vnode) of the newly mounted
> >fs.  This could be very useful for, say, a login shell.
> >
> >I say "some" b/c I'm concerned about the possibility that some bad process
> >(rm -rf) that is just started in an emoty mnt point, all of sudden is moved
> >up to a vnode full of real files, and that process may happily go on to
> >delete the files in the newly mounted f/s.
> >
> >Doing the reverse upon unmount (moving the cwd from upper to lower) sounds
> >even stranger to me.  Why?  B/c the process used to see some files and now
> >it sees none.  Where did it all go?  This can break applications in all
> >sorts of unhappy ways.
> 
> Whether or not checkdirs() is retained, I think it is just good
> practice to undo at unmount time anything that was done when the
> filesystem was mounted. An obvious case is if you accidentally mount
> a file system in the wrong place or make the common mistake of
> typing "mount -a" when there are NFS entries in fstab that are
> already mounted. Without the unmount-time checkdirs call, this is
> an operation that cannot be undone because any processes that had
> a cwd of the covered vnode before the mount will lose their cwd
> entirely if you unmount it.

If you accidentally mount something in the wrong place, you should be able
to umount it quickly thereafter; the chance that some new process comes
along and "sits" on your cwd is rather rare.  And if it happens, you can
lsof and kill it, then umount just fine.

I don't understand why would a "mount -a" re-mount existing stuff like
already-mounted NFS volumes?  Does it?  It shouldn't IMHO.

I agree w/ you that umount should undo anything that a mount did, but I
think you may be allowing a mount to proceed in cases that it shouldn't have
succeeded; so you first "get yourself in trouble" and then try to find a way
to undo it. :-)

> There were also some obscure cases involving booting frem CD and
> then mounting the real root filesystem directly over /. If you
> unmount it later, all processes would lose their fd_rdir references
> to /, so they suddenly become chrooted into a dead vnode even though
> their original root directory on the CD root still exists.

OK, but "obscure cases" shouldn't IMHO change default common behavior.  Make
the default case the more common one, the one that will be used by most
users.  You went ahead and changed important behavior for a minority of
users.

> Anyway, I think the best solution for now is to make the checkdirs()
> at unmount time conditional on the MNT_FORCE flag. This should fix
> amd's EBUSY detection while still making it possible to fully undo
> the effects of a mount operation. The change is fairly trivial, so
> I'll see if I can get something committed before 5.2 is released.

Thanks.  That'd help.  I would also hope that the existing cwd-migrating
behavior will become the one that someone has to trigger using MNT_FORCE;
that is, please make the default behavior be the old behavior (EBUSY and
such).  Anyone who really wants the new behavior should use MNT_FORCE (I
assume there's a flag for it in umount(8) also.)

> Ian

Cheers,
Erez.

From owner-freebsd-fs@FreeBSD.ORG  Thu Nov 27 15:26:48 2003
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 6BCA716A4CE
	for <freebsd-fs@freebsd.org>; Thu, 27 Nov 2003 15:26:48 -0800 (PST)
Received: from mtl.alis.com (mtl.alis.com [199.84.165.71])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 2289543FBD
	for <freebsd-fs@freebsd.org>; Thu, 27 Nov 2003 15:26:43 -0800 (PST)
	(envelope-from vgoupil@alis.com)
Received: from alis-2k.alis.domain (alis-2k.alis.com [199.84.165.130])
	by mtl.alis.com (8.12.8p2/8.12.8) with ESMTP id hARNQfsv073538
	for <freebsd-fs@freebsd.org>; Thu, 27 Nov 2003 18:26:41 -0500 (EST)
	(envelope-from vgoupil@alis.com)
Received: by alis-2k.alis.domain with Internet Mail Service (5.5.2653.19)
	id <WY5948H7>; Thu, 27 Nov 2003 18:26:41 -0500
Message-ID: <F7D4BDA0E5A1D14B99D32C022AEB7366FE10D4@alis-2k.alis.domain>
From: Vincent Goupil <vgoupil@alis.com>
To: "'freebsd-fs@freebsd.org'" <freebsd-fs@freebsd.org>
Date: Thu, 27 Nov 2003 18:26:32 -0500
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: text/plain;
	charset="ISO-8859-1"
X-Spam-Checker-Version: SpamAssassin 2.53 (1.174.2.15-2003-03-30-exp)
Subject: mfs is getting full (/etc/rc.diskless2)
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 27 Nov 2003 23:26:48 -0000

Hi,

I've setup a firewall with a compact flash instead of a hard-drive.  This is
the output of mount:
/dev/ad0s2a on / (ufs, local, read-only)
mfs:17 on /var (mfs, asynchronous, local)
procfs on /proc (procfs, local)
mfs:36 on /dev (mfs, asynchronous, local)

As you see, I mount the compact flash as read-only and I setup a memory
filesystem for /var

In my rc.conf file:
diskless_mount="/etc/rc.diskless2"
varsize="131072"

Output of: df
Filesystem  1K-blocks   Used Avail Capacity  Mounted on
/dev/ad0s2a    229942 197570 13978    93%    /
mfs:17          63471  56584  1810    97%    /var
procfs              4      4     0   100%    /proc
mfs:36           1503     66  1317     5%    /dev

Output of: df -h
Filesystem    Size   Used  Avail Capacity  Mounted on
/dev/ad0s2a   225M   193M    14M    93%    /
mfs:17         62M    55M   1.8M    97%    /var
procfs        4.0K   4.0K     0B   100%    /proc
mfs:36        1.5M    66K   1.3M     5%    /dev

Output of: du -h -d 1 /var
364K    /var/db
1.0K    /var/account
3.0K    /var/at
1.0K    /var/backups
1.0K    /var/crash
2.0K    /var/cron
1.0K    /var/empty
5.0K    /var/games
1.0K    /var/heimdal
3.2M    /var/log
 31K    /var/mail
2.0K    /var/msgs
1.0K    /var/preserve
 47K    /var/run
1.0K    /var/rwho
 16K    /var/spool
3.0K    /var/tmp
1.0K    /var/yp
1.6M    /var/mrtg
2.0K    /var/ucd-snmp
5.3M    /var

Output of: du -d 1 /var
364     /var/db
1       /var/account
3       /var/at
1       /var/backups
1       /var/crash
2       /var/cron
1       /var/empty
5       /var/games
1       /var/heimdal
3299    /var/log
31      /var/mail
2       /var/msgs
1       /var/preserve
47      /var/run
1       /var/rwho
16      /var/spool
3       /var/tmp
1       /var/yp
1670    /var/mrtg
2       /var/ucd-snmp
5456    /var

It seems to have a big difference between the output of df and du (I know, I
read
http://www.freebsd.org/doc/en_US.ISO8859-1/books/faq/disks.html#DU-VS-DF ),
but it's now explaining everything.  I could be the way I setup it.

My problem is, my /var partition is getting filled very quickly and I don't
know why ?  I don't know what to clean.  I've already deleted some log, but
I saved only 2% of free space or 1000 block.

I don't know what is taking all this space ?  Any ideas ?

Vincent Goupil

From owner-freebsd-fs@FreeBSD.ORG  Thu Nov 27 18:32:50 2003
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 8630C16A4CF
	for <freebsd-fs@freebsd.org>; Thu, 27 Nov 2003 18:32:50 -0800 (PST)
Received: from bilver.wjv.com (user38.net339.fl.sprint-hsd.net [65.40.24.38])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 065DF43FE0
	for <freebsd-fs@freebsd.org>; Thu, 27 Nov 2003 18:32:48 -0800 (PST)
	(envelope-from bv@bilver.wjv.com)
Received: from bilver.wjv.com (localhost.wjv.com [127.0.0.1])
	by bilver.wjv.com (8.12.10/8.12.10) with ESMTP id hAS2Wjm7061257
	for <freebsd-fs@freebsd.org>; Thu, 27 Nov 2003 21:32:45 -0500 (EST)
	(envelope-from bv@bilver.wjv.com)
Received: (from bv@localhost)
	by bilver.wjv.com (8.12.10/8.12.10/Submit) id hAS2Wjjl061256
	for freebsd-fs@freebsd.org; Thu, 27 Nov 2003 21:32:45 -0500 (EST)
	(envelope-from bv)
Date: Thu, 27 Nov 2003 21:32:45 -0500
From: Bill Vermillion <bv@wjv.com>
To: freebsd-fs@freebsd.org
Message-ID: <20031128023245.GA61208@wjv.com>
References: <F7D4BDA0E5A1D14B99D32C022AEB7366FE10D4@alis-2k.alis.domain>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <F7D4BDA0E5A1D14B99D32C022AEB7366FE10D4@alis-2k.alis.domain>
Organization: W.J.Vermillion / Orlando - Winter Park
ReplyTo: bv@wjv.com
User-Agent: Mutt/1.5.4i
X-Spam-Status: No, hits=0.0 required=5.0 tests=none autolearn=no version=2.60
X-Spam-Checker-Version: SpamAssassin 2.60 (1.212-2003-09-23-exp) on 
	bilver.wjv.com
Subject: Re: mfs is getting full (/etc/rc.diskless2)
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
Reply-To: bv@wjv.com
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Nov 2003 02:32:50 -0000

Earlier in the linear time track, on approximately Thu, Nov 27, 2003 at 18:26 ,
Vincent Goupil divulged this public information:


> I've setup a firewall with a compact flash instead of a hard-drive.  This is
> the output of mount:
> /dev/ad0s2a on / (ufs, local, read-only)
> mfs:17 on /var (mfs, asynchronous, local)
> procfs on /proc (procfs, local)
> mfs:36 on /dev (mfs, asynchronous, local)

> As you see, I mount the compact flash as read-only and I setup a memory
> filesystem for /var

> In my rc.conf file:
> diskless_mount="/etc/rc.diskless2"
> varsize="131072"

> Output of: df
> Filesystem  1K-blocks   Used Avail Capacity  Mounted on
> /dev/ad0s2a    229942 197570 13978    93%    /
> mfs:17          63471  56584  1810    97%    /var
> procfs              4      4     0   100%    /proc
> mfs:36           1503     66  1317     5%    /dev
> 
> Output of: df -h
> Filesystem    Size   Used  Avail Capacity  Mounted on
> /dev/ad0s2a   225M   193M    14M    93%    /
> mfs:17         62M    55M   1.8M    97%    /var
> procfs        4.0K   4.0K     0B   100%    /proc
> mfs:36        1.5M    66K   1.3M     5%    /dev
> 
> Output of: du -h -d 1 /var
> 364K    /var/db
> 1.0K    /var/account
> 3.0K    /var/at
> 1.0K    /var/backups
> 1.0K    /var/crash
> 2.0K    /var/cron
> 1.0K    /var/empty
> 5.0K    /var/games
> 1.0K    /var/heimdal
> 3.2M    /var/log
>  31K    /var/mail
> 2.0K    /var/msgs
> 1.0K    /var/preserve
>  47K    /var/run
> 1.0K    /var/rwho
>  16K    /var/spool
> 3.0K    /var/tmp
> 1.0K    /var/yp
> 1.6M    /var/mrtg
> 2.0K    /var/ucd-snmp
> 5.3M    /var
> 
> Output of: du -d 1 /var
> 364     /var/db
> 1       /var/account
> 3       /var/at
> 1       /var/backups
> 1       /var/crash
> 2       /var/cron
> 1       /var/empty
> 5       /var/games
> 1       /var/heimdal
> 3299    /var/log
> 31      /var/mail
> 2       /var/msgs
> 1       /var/preserve
> 47      /var/run
> 1       /var/rwho
> 16      /var/spool
> 3       /var/tmp
> 1       /var/yp
> 1670    /var/mrtg
> 2       /var/ucd-snmp
> 5456    /var

> It seems to have a big difference between
> the output of df and du (I know, I read
> http://www.freebsd.org/doc/en_US.ISO8859-1/books/faq/disks.html#
> DU-VS-DF ), but it's now explaining everything. I could be the
> way I setup it.

> My problem is, my /var partition is getting filled very quickly
> and I don't know why ? I don't know what to clean. I've already
> deleted some log, but I saved only 2% of free space or 1000
> block.

> I don't know what is taking all this space ?  Any ideas ?

It sounds like you deleted some log file that the system keeps
open.  So it will keep using up disk space even though the name is
gone.  A file is not deleted until the last link is gone and if the
file is opened for loging by a program that never releases the file
that is your problem.

At that point the easiest way is to reboot.    Then find out what
you are logging and stop the things you don't need.  When a log
gets full DO NOT remove it.  Null it out. 

Just doing    > <logfilename>     should empty the log and reset
the pointer back to the first of the file and release all blocks in
use.

Bill
-- 
Bill Vermillion - bv @ wjv . com

From owner-freebsd-fs@FreeBSD.ORG  Fri Nov 28 03:07:30 2003
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 8DBC216A4CE
	for <freebsd-fs@FreeBSD.org>; Fri, 28 Nov 2003 03:07:30 -0800 (PST)
Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 1653B43FE3
	for <freebsd-fs@FreeBSD.org>; Fri, 28 Nov 2003 03:07:29 -0800 (PST)
	(envelope-from truckman@FreeBSD.org)
Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2])
	by gw.catspoiler.org (8.12.9p2/8.12.9) with ESMTP id hASB7BeF017371;
	Fri, 28 Nov 2003 03:07:15 -0800 (PST)
	(envelope-from truckman@FreeBSD.org)
Message-Id: <200311281107.hASB7BeF017371@gw.catspoiler.org>
Date: Fri, 28 Nov 2003 03:07:11 -0800 (PST)
From: Don Lewis <truckman@FreeBSD.org>
To: kmarx@vicor.com
In-Reply-To: <3FC27F98.8090801@vicor.com>
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=us-ascii
cc: freebsd-fs@FreeBSD.org
cc: mckusick@beastie.mckusick.com
Subject: Re: 4.8 ffs_dirpref problem
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Nov 2003 11:07:30 -0000

On 24 Nov, Ken Marx wrote:
> 
> 
> Don Lewis wrote:

>> Index: sys/kern/vfs_bio.c
>> ===================================================================
>> RCS file: /home/ncvs/src/sys/kern/vfs_bio.c,v
>> retrieving revision 1.242.2.21
>> diff -u -r1.242.2.21 vfs_bio.c
>> --- sys/kern/vfs_bio.c	9 Aug 2003 16:21:19 -0000	1.242.2.21
>> +++ sys/kern/vfs_bio.c	18 Nov 2003 02:10:55 -0000
>> @@ -140,6 +140,7 @@
>>  	&bufreusecnt, 0, "");
>>  
>>  static int bufhashmask;
>> +static int bufhashshift;
>>  static LIST_HEAD(bufhashhdr, buf) *bufhashtbl, invalhash;
>>  struct bqueues bufqueues[BUFFER_QUEUES] = { { 0 } };
>>  char *buf_wmesg = BUF_WMESG;
>> @@ -160,7 +161,20 @@
>>  struct bufhashhdr *
>>  bufhash(struct vnode *vnp, daddr_t bn)
>>  {
>> -	return(&bufhashtbl[(((uintptr_t)(vnp) >> 7) + (int)bn) & bufhashmask]);
>> +	u_int64_t hashkey64;
>> +	int hashkey; 
>> +	
>> +	/*
>> +	 * Fibonacci hash, see Knuth's
>> +	 * _Art of Computer Programming, Volume 3 / Sorting and Searching_
>> +	 *
>> +         * We reduce the argument to 32 bits before doing the hash to
>> +	 * avoid the need for a slow 64x64 multiply on 32 bit platforms.
>> +	 */
>> +	hashkey64 = (u_int64_t)(uintptr_t)vnp + (u_int64_t)bn;
>> +	hashkey = (((u_int32_t)(hashkey64 + (hashkey64 >> 32)) * 2654435769u) >>
>> +	    bufhashshift) & bufhashmask;
>> +	return(&bufhashtbl[hashkey]);
>>  }
>>  
>>  /*
>> @@ -319,8 +333,9 @@
>>  bufhashinit(caddr_t vaddr)
>>  {
>>  	/* first, make a null hash table */
>> +	bufhashshift = 29;
>>  	for (bufhashmask = 8; bufhashmask < nbuf / 4; bufhashmask <<= 1)
>> -		;
>> +		bufhashshift--;
>>  	bufhashtbl = (void *)vaddr;
>>  	vaddr = vaddr + sizeof(*bufhashtbl) * bufhashmask;
>>  	--bufhashmask;
>> 
>> 
> 
> Well, I'm mildly beflummoxed - I tried to compare hashtable preformance
> between all three known versions of the hashing - legacy power of 2,
> the Vicor ^= hash, and Don's fibonacci hash.
> 
> Running with 
> 
>        minifree = max( 1, avgifree / 4 );
>        minbfree = max( 1, avgbfree );
> 
> all perform about the same, with no performance problems all
> the way up to 100% disk capacity (didn't test into reserved space).
> 
> Looking at instrumentation to show freq and avg depth of the
> hash buckets, everything seems very calm (mainly because
> we're not hitting the linear searching very often, I'd presume).
> 
> I can't explain why I seemlingly got performance problems
> with similar (identical) minbfree code previously.
> 
> So, out of spite, I went back to 
> 
> 	minbfree = max( 1, avgbfree/4 );
> 
> This does hit the hashtable harder for the legacy version
> and not so much for either new flavor. Here are a few
> samplings of calling my dump routine from the debugger.
> "avgdepth" really means 'search depth' since we use
> the depth reached after finding a bp in gbincore.
> 
> The line below such as,
> 
> 	 0: avgdepth[1] cnt=801
> 
> means that 801 of the hashtable buckets had an avg search
> depth of 1 at the time the debug routine was called.
> The 'N:' prefix means the N-th unique non-zero such value.
> So large cnt's for small []'d depth values means an efficient hash.
> 
> I've edited out the details as much as possible.
> 
> LEGACY:
> --------
> Nov 24 13:34:54 oos0b /kernel: bh[442/0x1ba]: freq=2706110, avgdepth = 154
> ...
> Nov 24 13:34:54 oos0b /kernel: 0: avgdepth[1] cnt=1015
> Nov 24 13:34:54 oos0b /kernel: 1: avgdepth[2] cnt=7
> Nov 24 13:34:54 oos0b /kernel: 2: avgdepth[154] cnt=1	<- !!
> Nov 24 13:34:54 oos0b /kernel: 3: avgdepth[3] cnt=1
>  -----------
> 
> Nov 24 13:36:49 oos0b /kernel: bh[442/0x1ba]: freq=3416953, avgdepth = 141
> ...
> Nov 24 13:36:49 oos0b /kernel: 0: avgdepth[1] cnt=1017
> Nov 24 13:36:49 oos0b /kernel: 1: avgdepth[141] cnt=1
> Nov 24 13:36:49 oos0b /kernel: 2: avgdepth[2] cnt=6
> 
> VICOR x-or hashtable:
> ---------------------
> Nov 24 13:07:24 oos0b /kernel: 0: avgdepth[1] cnt=762
> Nov 24 13:07:24 oos0b /kernel: 1: avgdepth[2] cnt=259
> Nov 24 13:07:24 oos0b /kernel: 2: avgdepth[3] cnt=3
>  -----------
> 
> Nov 24 13:08:07 oos0b /kernel: 0: avgdepth[1] cnt=744
> Nov 24 13:08:07 oos0b /kernel: 1: avgdepth[2] cnt=275
> Nov 24 13:08:07 oos0b /kernel: 2: avgdepth[3] cnt=5
> 
> FIBONACCI:
> ----------
> Nov 24 11:56:50 oos0b /kernel: 0: avgdepth[1] cnt=811
> Nov 24 11:56:50 oos0b /kernel: 1: avgdepth[3] cnt=88
> Nov 24 11:56:50 oos0b /kernel: 2: avgdepth[2] cnt=124
> Nov 24 11:56:50 oos0b /kernel: 3: avgdepth[0] cnt=1
>  -----------
> 
> Nov 24 11:57:48 oos0b /kernel: 0: avgdepth[1] cnt=801
> Nov 24 11:57:48 oos0b /kernel: 1: avgdepth[3] cnt=93
> Nov 24 11:57:48 oos0b /kernel: 2: avgdepth[2] cnt=130
> 
> So, while this is far from analytically eshaustive,
> it almost appears the fibonacci hash has more entries
> of depth 3, while the Vicor one has more at depth 2.
> 
> I'm happy to run more tests if you have ideas. I'm also fine
> to cut bait and go with whatever you decide. It *seems* like
> putting the fibonacci hash is prudent since the current hash
> has been observed to be expensive. I had trouble proving this
> unequivocally though. So, perhaps Don's minbfree fix is sufficient
> after all. I'm tempted at this point to go with the 100% flavor.

I think we're running into one of the weaknesses in the Fibonacci hash.
There are a large number of hash entries for the cylinder group blocks,
which are located at offsets which are multiples of 89 * 2^10 in your
example, or something on the order of 2^16.  The effect of this is for
the cylinder group number to be hashed using the least significant bits
of the hash multiplier, which don't work as well for distributing the
hash values.  I tried some of Knuth's suggestions, and got better
results with the hash multiplier 0x9E376DB1u.  The most significant 16
bits of the multplier are the same as the original constant, and the
least significant bits act as a fraction in the desirable range of 1/3
to 3/7.  Please give this new hash multiplier a try.

From owner-freebsd-fs@FreeBSD.ORG  Fri Nov 28 07:43:21 2003
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id E825B16A4CE
	for <freebsd-fs@freebsd.org>; Fri, 28 Nov 2003 07:43:21 -0800 (PST)
Received: from bilver.wjv.com (user38.net339.fl.sprint-hsd.net [65.40.24.38])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 265E843F3F
	for <freebsd-fs@freebsd.org>; Fri, 28 Nov 2003 07:43:19 -0800 (PST)
	(envelope-from bv@bilver.wjv.com)
Received: from bilver.wjv.com (localhost.wjv.com [127.0.0.1])
	by bilver.wjv.com (8.12.10/8.12.10) with ESMTP id hASFhAm7049742;
	Fri, 28 Nov 2003 10:43:10 -0500 (EST)
	(envelope-from bv@bilver.wjv.com)
Received: (from bv@localhost)
	by bilver.wjv.com (8.12.10/8.12.10/Submit) id hASFh6jX049656;
	Fri, 28 Nov 2003 10:43:06 -0500 (EST)
	(envelope-from bv)
Date: Fri, 28 Nov 2003 10:43:06 -0500
From: Bill Vermillion <bv@wjv.com>
To: Vincent Goupil <vgoupil@alis.com>
Message-ID: <20031128154306.GC47553@wjv.com>
References: <F7D4BDA0E5A1D14B99D32C022AEB7366FE10D6@alis-2k.alis.domain>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <F7D4BDA0E5A1D14B99D32C022AEB7366FE10D6@alis-2k.alis.domain>
Organization: W.J.Vermillion / Orlando - Winter Park
ReplyTo: bv@wjv.com
User-Agent: Mutt/1.5.4i
X-Spam-Status: No, hits=0.0 required=5.0 tests=none autolearn=no version=2.60
X-Spam-Checker-Version: SpamAssassin 2.60 (1.212-2003-09-23-exp) on 
	bilver.wjv.com
cc: freebsd-fs@freebsd.org
Subject: Re: mfs is getting full (/etc/rc.diskless2)
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
Reply-To: bv@wjv.com
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Nov 2003 15:43:22 -0000

Shakespeare wrote plays and sonnets which will last an eternity, 
but on Fri, Nov 28, 2003 at 10:16 , Vincent Goupil wrote:

> It wasn't the logfile directly, I delete logfiles that has been
> rotated (like .0.gz)

But if you did that with the apache log files - if they were in
var, you can't use syslog to rotate them.  The files will stay
open.  You have to stop and restart apache, or use the 'rotatelogs'
that comes with Apache.

If you don't have Apache log files some other file - one that keeps
a file open but does not close it each time it writes - can give
you the same results.

Bill
> 
> 
> -----Original Message-----
> From: Bill Vermillion [mailto:bv@wjv.com]
> Sent: 27 novembre, 2003 21:33
> To: freebsd-fs@freebsd.org
> Subject: Re: mfs is getting full (/etc/rc.diskless2)
> 
> 
> Earlier in the linear time track, on approximately Thu, Nov 27, 2003 at
> 18:26 ,
> Vincent Goupil divulged this public information:
> 
> 
> > I've setup a firewall with a compact flash instead of a hard-drive.  This
> is
> > the output of mount:
> > /dev/ad0s2a on / (ufs, local, read-only)
> > mfs:17 on /var (mfs, asynchronous, local)
> > procfs on /proc (procfs, local)
> > mfs:36 on /dev (mfs, asynchronous, local)
> 
> > As you see, I mount the compact flash as read-only and I setup a memory
> > filesystem for /var
> 
> > In my rc.conf file:
> > diskless_mount="/etc/rc.diskless2"
> > varsize="131072"
> 
> > Output of: df
> > Filesystem  1K-blocks   Used Avail Capacity  Mounted on
> > /dev/ad0s2a    229942 197570 13978    93%    /
> > mfs:17          63471  56584  1810    97%    /var
> > procfs              4      4     0   100%    /proc
> > mfs:36           1503     66  1317     5%    /dev
> > 
> > Output of: df -h
> > Filesystem    Size   Used  Avail Capacity  Mounted on
> > /dev/ad0s2a   225M   193M    14M    93%    /
> > mfs:17         62M    55M   1.8M    97%    /var
> > procfs        4.0K   4.0K     0B   100%    /proc
> > mfs:36        1.5M    66K   1.3M     5%    /dev
> > 
> > Output of: du -h -d 1 /var
> > 364K    /var/db
> > 1.0K    /var/account
> > 3.0K    /var/at
> > 1.0K    /var/backups
> > 1.0K    /var/crash
> > 2.0K    /var/cron
> > 1.0K    /var/empty
> > 5.0K    /var/games
> > 1.0K    /var/heimdal
> > 3.2M    /var/log
> >  31K    /var/mail
> > 2.0K    /var/msgs
> > 1.0K    /var/preserve
> >  47K    /var/run
> > 1.0K    /var/rwho
> >  16K    /var/spool
> > 3.0K    /var/tmp
> > 1.0K    /var/yp
> > 1.6M    /var/mrtg
> > 2.0K    /var/ucd-snmp
> > 5.3M    /var
> > 
> > Output of: du -d 1 /var
> > 364     /var/db
> > 1       /var/account
> > 3       /var/at
> > 1       /var/backups
> > 1       /var/crash
> > 2       /var/cron
> > 1       /var/empty
> > 5       /var/games
> > 1       /var/heimdal
> > 3299    /var/log
> > 31      /var/mail
> > 2       /var/msgs
> > 1       /var/preserve
> > 47      /var/run
> > 1       /var/rwho
> > 16      /var/spool
> > 3       /var/tmp
> > 1       /var/yp
> > 1670    /var/mrtg
> > 2       /var/ucd-snmp
> > 5456    /var
> 
> > It seems to have a big difference between
> > the output of df and du (I know, I read
> > http://www.freebsd.org/doc/en_US.ISO8859-1/books/faq/disks.html#
> > DU-VS-DF ), but it's now explaining everything. I could be the
> > way I setup it.
> 
> > My problem is, my /var partition is getting filled very quickly
> > and I don't know why ? I don't know what to clean. I've already
> > deleted some log, but I saved only 2% of free space or 1000
> > block.
> 
> > I don't know what is taking all this space ?  Any ideas ?
> 
> It sounds like you deleted some log file that the system keeps
> open.  So it will keep using up disk space even though the name is
> gone.  A file is not deleted until the last link is gone and if the
> file is opened for loging by a program that never releases the file
> that is your problem.
> 
> At that point the easiest way is to reboot.    Then find out what
> you are logging and stop the things you don't need.  When a log
> gets full DO NOT remove it.  Null it out. 
> 
> Just doing    > <logfilename>     should empty the log and reset
> the pointer back to the first of the file and release all blocks in
> use.
> 
> Bill
> -- 
> Bill Vermillion - bv @ wjv . com
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"

-- 
Bill Vermillion - bv @ wjv . com

From owner-freebsd-fs@FreeBSD.ORG  Fri Nov 28 13:35:15 2003
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 643E216A4CE
	for <freebsd-fs@FreeBSD.org>; Fri, 28 Nov 2003 13:35:15 -0800 (PST)
Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 36DE643FA3
	for <freebsd-fs@FreeBSD.org>; Fri, 28 Nov 2003 13:35:13 -0800 (PST)
	(envelope-from truckman@FreeBSD.org)
Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2])
	by gw.catspoiler.org (8.12.9p2/8.12.9) with ESMTP id hASLYveF018257;
	Fri, 28 Nov 2003 13:35:01 -0800 (PST)
	(envelope-from truckman@FreeBSD.org)
Message-Id: <200311282135.hASLYveF018257@gw.catspoiler.org>
Date: Fri, 28 Nov 2003 13:34:57 -0800 (PST)
From: Don Lewis <truckman@FreeBSD.org>
To: kmarx@vicor.com
In-Reply-To: <200311281107.hASB7BeF017371@gw.catspoiler.org>
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=us-ascii
cc: freebsd-fs@FreeBSD.org
cc: mckusick@beastie.mckusick.com
Subject: Re: 4.8 ffs_dirpref problem
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Nov 2003 21:35:15 -0000

On 28 Nov, To: kmarx@vicor.com wrote:
> On 24 Nov, Ken Marx wrote:
>> 
>> 
>> Don Lewis wrote:
> 
>>> Index: sys/kern/vfs_bio.c
>>> ===================================================================
>>> RCS file: /home/ncvs/src/sys/kern/vfs_bio.c,v
>>> retrieving revision 1.242.2.21
>>> diff -u -r1.242.2.21 vfs_bio.c
>>> --- sys/kern/vfs_bio.c	9 Aug 2003 16:21:19 -0000	1.242.2.21
>>> +++ sys/kern/vfs_bio.c	18 Nov 2003 02:10:55 -0000
>>> @@ -140,6 +140,7 @@
>>>  	&bufreusecnt, 0, "");
>>>  
>>>  static int bufhashmask;
>>> +static int bufhashshift;
>>>  static LIST_HEAD(bufhashhdr, buf) *bufhashtbl, invalhash;
>>>  struct bqueues bufqueues[BUFFER_QUEUES] = { { 0 } };
>>>  char *buf_wmesg = BUF_WMESG;
>>> @@ -160,7 +161,20 @@
>>>  struct bufhashhdr *
>>>  bufhash(struct vnode *vnp, daddr_t bn)
>>>  {
>>> -	return(&bufhashtbl[(((uintptr_t)(vnp) >> 7) + (int)bn) & bufhashmask]);
>>> +	u_int64_t hashkey64;
>>> +	int hashkey; 
>>> +	
>>> +	/*
>>> +	 * Fibonacci hash, see Knuth's
>>> +	 * _Art of Computer Programming, Volume 3 / Sorting and Searching_
>>> +	 *
>>> +         * We reduce the argument to 32 bits before doing the hash to
>>> +	 * avoid the need for a slow 64x64 multiply on 32 bit platforms.
>>> +	 */
>>> +	hashkey64 = (u_int64_t)(uintptr_t)vnp + (u_int64_t)bn;
>>> +	hashkey = (((u_int32_t)(hashkey64 + (hashkey64 >> 32)) * 2654435769u) >>
>>> +	    bufhashshift) & bufhashmask;
>>> +	return(&bufhashtbl[hashkey]);
>>>  }
>>>  
>>>  /*
>>> @@ -319,8 +333,9 @@
>>>  bufhashinit(caddr_t vaddr)
>>>  {
>>>  	/* first, make a null hash table */
>>> +	bufhashshift = 29;
>>>  	for (bufhashmask = 8; bufhashmask < nbuf / 4; bufhashmask <<= 1)
>>> -		;
>>> +		bufhashshift--;
>>>  	bufhashtbl = (void *)vaddr;
>>>  	vaddr = vaddr + sizeof(*bufhashtbl) * bufhashmask;
>>>  	--bufhashmask;
>>> 
>>> 
>> 
>> Well, I'm mildly beflummoxed - I tried to compare hashtable preformance
>> between all three known versions of the hashing - legacy power of 2,
>> the Vicor ^= hash, and Don's fibonacci hash.
>> 
>> Running with 
>> 
>>        minifree = max( 1, avgifree / 4 );
>>        minbfree = max( 1, avgbfree );
>> 
>> all perform about the same, with no performance problems all
>> the way up to 100% disk capacity (didn't test into reserved space).
>> 
>> Looking at instrumentation to show freq and avg depth of the
>> hash buckets, everything seems very calm (mainly because
>> we're not hitting the linear searching very often, I'd presume).
>> 
>> I can't explain why I seemlingly got performance problems
>> with similar (identical) minbfree code previously.
>> 
>> So, out of spite, I went back to 
>> 
>> 	minbfree = max( 1, avgbfree/4 );
>> 
>> This does hit the hashtable harder for the legacy version
>> and not so much for either new flavor. Here are a few
>> samplings of calling my dump routine from the debugger.
>> "avgdepth" really means 'search depth' since we use
>> the depth reached after finding a bp in gbincore.
>> 
>> The line below such as,
>> 
>> 	 0: avgdepth[1] cnt=801
>> 
>> means that 801 of the hashtable buckets had an avg search
>> depth of 1 at the time the debug routine was called.
>> The 'N:' prefix means the N-th unique non-zero such value.
>> So large cnt's for small []'d depth values means an efficient hash.
>> 
>> I've edited out the details as much as possible.
>> 
>> LEGACY:
>> --------
>> Nov 24 13:34:54 oos0b /kernel: bh[442/0x1ba]: freq=2706110, avgdepth = 154
>> ...
>> Nov 24 13:34:54 oos0b /kernel: 0: avgdepth[1] cnt=1015
>> Nov 24 13:34:54 oos0b /kernel: 1: avgdepth[2] cnt=7
>> Nov 24 13:34:54 oos0b /kernel: 2: avgdepth[154] cnt=1	<- !!
>> Nov 24 13:34:54 oos0b /kernel: 3: avgdepth[3] cnt=1
>>  -----------
>> 
>> Nov 24 13:36:49 oos0b /kernel: bh[442/0x1ba]: freq=3416953, avgdepth = 141
>> ...
>> Nov 24 13:36:49 oos0b /kernel: 0: avgdepth[1] cnt=1017
>> Nov 24 13:36:49 oos0b /kernel: 1: avgdepth[141] cnt=1
>> Nov 24 13:36:49 oos0b /kernel: 2: avgdepth[2] cnt=6
>> 
>> VICOR x-or hashtable:
>> ---------------------
>> Nov 24 13:07:24 oos0b /kernel: 0: avgdepth[1] cnt=762
>> Nov 24 13:07:24 oos0b /kernel: 1: avgdepth[2] cnt=259
>> Nov 24 13:07:24 oos0b /kernel: 2: avgdepth[3] cnt=3
>>  -----------
>> 
>> Nov 24 13:08:07 oos0b /kernel: 0: avgdepth[1] cnt=744
>> Nov 24 13:08:07 oos0b /kernel: 1: avgdepth[2] cnt=275
>> Nov 24 13:08:07 oos0b /kernel: 2: avgdepth[3] cnt=5
>> 
>> FIBONACCI:
>> ----------
>> Nov 24 11:56:50 oos0b /kernel: 0: avgdepth[1] cnt=811
>> Nov 24 11:56:50 oos0b /kernel: 1: avgdepth[3] cnt=88
>> Nov 24 11:56:50 oos0b /kernel: 2: avgdepth[2] cnt=124
>> Nov 24 11:56:50 oos0b /kernel: 3: avgdepth[0] cnt=1
>>  -----------
>> 
>> Nov 24 11:57:48 oos0b /kernel: 0: avgdepth[1] cnt=801
>> Nov 24 11:57:48 oos0b /kernel: 1: avgdepth[3] cnt=93
>> Nov 24 11:57:48 oos0b /kernel: 2: avgdepth[2] cnt=130
>> 
>> So, while this is far from analytically eshaustive,
>> it almost appears the fibonacci hash has more entries
>> of depth 3, while the Vicor one has more at depth 2.
>> 
>> I'm happy to run more tests if you have ideas. I'm also fine
>> to cut bait and go with whatever you decide. It *seems* like
>> putting the fibonacci hash is prudent since the current hash
>> has been observed to be expensive. I had trouble proving this
>> unequivocally though. So, perhaps Don's minbfree fix is sufficient
>> after all. I'm tempted at this point to go with the 100% flavor.
> 
> I think we're running into one of the weaknesses in the Fibonacci hash.
> There are a large number of hash entries for the cylinder group blocks,
> which are located at offsets which are multiples of 89 * 2^10 in your
> example, or something on the order of 2^16.  The effect of this is for
> the cylinder group number to be hashed using the least significant bits
> of the hash multiplier, which don't work as well for distributing the
> hash values.  I tried some of Knuth's suggestions, and got better
> results with the hash multiplier 0x9E376DB1u.  The most significant 16
> bits of the multplier are the same as the original constant, and the
> least significant bits act as a fraction in the desirable range of 1/3
> to 3/7.  Please give this new hash multiplier a try.

I went ahead and spun a new version of my patch with the new multiplier,
one other tweak to the formula, and updated comments.

Index: sys/kern/vfs_bio.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/vfs_bio.c,v
retrieving revision 1.242.2.21
diff -u -r1.242.2.21 vfs_bio.c
--- sys/kern/vfs_bio.c	9 Aug 2003 16:21:19 -0000	1.242.2.21
+++ sys/kern/vfs_bio.c	28 Nov 2003 20:02:06 -0000
@@ -140,6 +140,7 @@
 	&bufreusecnt, 0, "");
 
 static int bufhashmask;
+static int bufhashshift;
 static LIST_HEAD(bufhashhdr, buf) *bufhashtbl, invalhash;
 struct bqueues bufqueues[BUFFER_QUEUES] = { { 0 } };
 char *buf_wmesg = BUF_WMESG;
@@ -160,7 +161,40 @@
 struct bufhashhdr *
 bufhash(struct vnode *vnp, daddr_t bn)
 {
-	return(&bufhashtbl[(((uintptr_t)(vnp) >> 7) + (int)bn) & bufhashmask]);
+	u_int64_t hashkey64;
+	int hashkey; 
+	
+	/*
+	 * A variation on the Fibonacci hash that Knuth credits to
+	 * R. W. Floyd, see Knuth's _Art of Computer Programming,
+	 * Volume 3 / Sorting and Searching_
+	 *
+         * We reduce the argument to 32 bits before doing the hash to
+	 * avoid the need for a slow 64x64 multiply on 32 bit platforms.
+	 *
+	 * sizeof(struct vnode) is 168 on i386, so toss some of the lower
+	 * bits of the vnode address to reduce the key range, which
+	 * improves the distribution of keys across buckets.
+	 *
+	 * The file system cylinder group blocks are very heavily
+	 * used.  They are located at invervals of fbg, which is
+	 * on the order of 89 to 94 * 2^10, depending on other
+	 * filesystem parameters, for a 16k block size.  Smaller block
+	 * sizes will reduce fpg approximately proportionally.  This
+	 * will cause the cylinder group index to be hashed using the
+	 * lower bits of the hash multiplier, which will not distribute
+	 * the keys as uniformly in a classic Fibonacci hash where a
+	 * relatively small number of the upper bits of the result
+	 * are used.  Using 2^16 as a close-enough approximation to
+	 * fpg, split the hash multiplier in half, with the upper 16
+	 * bits being the inverse of the golden ratio, and the lower
+	 * 16 bits being a fraction between 1/3 and 3/7 (closer to
+	 * 3/7 in this case), that gives good experimental results.
+	 */
+	hashkey64 = ((u_int64_t)(uintptr_t)vnp >> 3) + (u_int64_t)bn;
+	hashkey = (((u_int32_t)(hashkey64 + (hashkey64 >> 32)) * 0x9E376DB1u) >>
+	    bufhashshift) & bufhashmask;
+	return(&bufhashtbl[hashkey]);
 }
 
 /*
@@ -319,8 +353,9 @@
 bufhashinit(caddr_t vaddr)
 {
 	/* first, make a null hash table */
+	bufhashshift = 29;
 	for (bufhashmask = 8; bufhashmask < nbuf / 4; bufhashmask <<= 1)
-		;
+		bufhashshift--;
 	bufhashtbl = (void *)vaddr;
 	vaddr = vaddr + sizeof(*bufhashtbl) * bufhashmask;
 	--bufhashmask;

From owner-freebsd-fs@FreeBSD.ORG  Sat Nov 29 13:29:50 2003
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id C534816A4CE; Sat, 29 Nov 2003 13:29:49 -0800 (PST)
Received: from obsecurity.dyndns.org
	(adsl-63-207-60-234.dsl.lsan03.pacbell.net [63.207.60.234])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id C494F43F85; Sat, 29 Nov 2003 13:29:47 -0800 (PST)
	(envelope-from kris@obsecurity.org)
Received: by obsecurity.dyndns.org (Postfix, from userid 1000)
	id 5D84566D26; Sat, 29 Nov 2003 13:29:47 -0800 (PST)
Date: Sat, 29 Nov 2003 13:29:46 -0800
From: Kris Kennaway <kris@obsecurity.org>
To: Kris Kennaway <kris@obsecurity.org>
Message-ID: <20031129212946.GA8894@xor.obsecurity.org>
References: <20031124205800.GA20935@xor.obsecurity.org>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="W/nzBZO5zC0uMSeA"
Content-Disposition: inline
In-Reply-To: <20031124205800.GA20935@xor.obsecurity.org>
User-Agent: Mutt/1.4.1i
cc: re@FreeBSD.org
cc: current@FreeBSD.org
cc: fs@FreeBSD.org
Subject: Re: recursed on non-recursive lock (sleep mutex) vnode interlock @
	/var/portbuild/sparc64/src-client/sys/ufs/ufs/ufs_ihash.c:128
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 29 Nov 2003 21:29:50 -0000


--W/nzBZO5zC0uMSeA
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

I got this on an alpha machine as well.  Can someone track it down?

msgbufp =3D 0xfffffc0023f85fe0
magic =3D 63062, size =3D 32736, r=3D 59046, w =3D 59565, ptr =3D 0xfffffc0=
023f7e000, cksum=3D 2511626
lock order reversal
 1st 0xfffffc001a793d80 vnode interlock (vnode interlock) @ /a/asami/portbu=
ild/alpha/src-client/sys/ufs/ufs/ufs_ihash.c:128
 2nd 0xfffffc00006feda0 ufs ihash (ufs ihash) @ /a/asami/portbuild/alpha/sr=
c-client/sys/ufs/ufs/ufs_ihash.c:124
Stack backtrace:
recursed on non-recursive lock (sleep mutex) vnode interlock @ /a/asami/por=
tbuild/alpha/src-client/sys/ufs/ufs/ufs_ihash.c:128
first acquired @ /a/asami/portbuild/alpha/src-client/sys/ufs/ufs/ufs_ihash.=
c:128
Debugger() at Debugger+0x38
panic() at panic+0x168
witness_lock() at witness_lock+0x408
_mtx_lock_flags() at _mtx_lock_flags+0xc8
ufs_ihashget() at ufs_ihashget+0xec
ffs_vget() at ffs_vget+0x54
ufs_lookup() at ufs_lookup+0xc9c
ufs_vnoperate() at ufs_vnoperate+0x2c
vfs_cache_lookup() at vfs_cache_lookup+0x37c
ufs_vnoperate() at ufs_vnoperate+0x2c
lookup() at lookup+0x4dc
namei() at namei+0x310
stat() at stat+0x4c
syscall() at syscall+0x39c
XentSys() at XentSys+0x64
--- syscall (188, FreeBSD ELF64, stat) ---
--- user mode ---
db>

Kris

On Mon, Nov 24, 2003 at 12:58:01PM -0800, Kris Kennaway wrote:
> One of my sparc64 package machines (running -current from Nov 21) died
> overnight with the following:
>=20
> recursed on non-recursive lock (sleep mutex) vnode interlock @ /var/portb=
uild/sparc64/src-client/sys/ufs/ufs/ufs_ihash.c:128
> first acquired @ /var/portbuild/sparc64/src-client/sys/ufs/ufs/ufs_ihash.=
c:128
> panic: recurse
> cpuid =3D 0;
> Debugger("panic")
> Stopped at      Debugger+0x1c:  ta              %xcc, 1
> db> trace
> panic() at panic+0x174
> witness_lock() at witness_lock+0x3b4
> _mtx_lock_flags() at _mtx_lock_flags+0x9c
> ufs_ihashget() at ufs_ihashget+0x94
> ffs_vget() at ffs_vget+0x20
> ufs_lookup() at ufs_lookup+0xb2c
> ufs_vnoperate() at ufs_vnoperate+0x1c
> vfs_cache_lookup() at vfs_cache_lookup+0x330
> ufs_vnoperate() at ufs_vnoperate+0x1c
> lookup() at lookup+0x408
> namei() at namei+0x254
> vn_open_cred() at vn_open_cred+0x208
> vn_open() at vn_open+0x18
> kern_open() at kern_open+0x84
> open() at open+0x14
> syscall() at syscall+0x308
> -- syscall (5, FreeBSD ELF64, open) %o7=3D0x4038c2b0 --
> userland() at 0x40395948
> user trace: trap %o7=3D0x4038c2b0
> pc 0x40395948, sp 0x7fdffffdaf1
> pc 0x4038b47c, sp 0x7fdffffdc31
> pc 0x101778, sp 0x7fdffffdcf1
> pc 0x101378, sp 0x7fdffffddb1
> pc 0x100f80, sp 0x7fdffffde71
> pc 0x4020a234, sp 0x7fdffffdf31
> done


--W/nzBZO5zC0uMSeA
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (FreeBSD)

iD8DBQE/yQ/KWry0BWjoQKURAppsAKCEE93XMKCRNO6qyOD046BVWKM8NACgyhDL
CHFrv87wA0gG5JnXURXqZIQ=
=mPQe
-----END PGP SIGNATURE-----

--W/nzBZO5zC0uMSeA--