From owner-freebsd-fs@FreeBSD.ORG  Sun Oct 15 20:59:28 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: fs@freebsd.org
Delivered-To: freebsd-fs@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id DF93516A403
	for <fs@freebsd.org>; Sun, 15 Oct 2006 20:59:28 +0000 (UTC)
	(envelope-from mohan_srinivasan@yahoo.com)
Received: from web30813.mail.mud.yahoo.com (web30813.mail.mud.yahoo.com
	[68.142.201.139])
	by mx1.FreeBSD.org (Postfix) with SMTP id 48B6143D77
	for <fs@freebsd.org>; Sun, 15 Oct 2006 20:59:24 +0000 (GMT)
	(envelope-from mohan_srinivasan@yahoo.com)
Received: (qmail 49806 invoked by uid 60001); 15 Oct 2006 20:59:23 -0000
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com;
	h=Message-ID:Received:Date:From:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding;
	b=R0/MP9JLtGZo49ZRVB5fPrdVjLR7tTpRvsarQphAO68+8I1ZqB8/IDHr0heP0xELYmGfsqBbxirE8hwnevS2fGSzxzeTOWsRiS4fYW3uUwfIK7CDxDBb4E0DvDa33zHkpzTKUH1D/S9PWkT2smZVhrMZ3G0PUHNPqKL9pgkYEHM=
	; 
Message-ID: <20061015205923.49804.qmail@web30813.mail.mud.yahoo.com>
Received: from [71.139.1.197] by web30813.mail.mud.yahoo.com via HTTP;
	Sun, 15 Oct 2006 13:59:23 PDT
Date: Sun, 15 Oct 2006 13:59:23 -0700 (PDT)
From: Mohan Srinivasan <mohan_srinivasan@yahoo.com>
To: Bruce Evans <bde@zeta.org.au>, fs@freebsd.org
In-Reply-To: <20061014143825.F1264@epsplex.bde.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Cc: mohans@freebsd.org
Subject: Re: lost dotdot caching pessimizes nfs especially
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 15 Oct 2006 20:59:29 -0000

Bruce

Defending the "silliness" of the first of the 2 changes you cite as I am the author of that 
change. Just got back from a short break and am still catching up on this thread.

The clearing of the attrcache on nfs_open() is a requirement for close-to-open
consistency, and this change fixed bugs that we saw internally relating to 
close-to-open consistency.

> and associated changes give silly behaviour that almost doubles the
> number of Access RPCs.  One of the associated changes clears n_attrstamp
> on close().  Then on open(), since lookup() is called before the above
> is reached, nfs_access_otw() has always just been called, and the above
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> forces another call.

That is not true with NFSv2 which doesn't have an access call, in which case
nfsspec_access() calls VOP_GETATTR, which may or may not go over the wire.

Also, what would happen with NFSv3 if we get an access cache hit ?

If lookup() can be made to pass a flag into nfs_open() that an otw getattr was
done, then we can eliminate the clearing of the attrcache in nfs_open(). But
absent that flag, I don't see how you can eliminate the fetch of fresh attributes
in nfs_open().

mohan

From owner-freebsd-fs@FreeBSD.ORG  Sun Oct 15 21:08:46 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: fs@freebsd.org
Delivered-To: freebsd-fs@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id CC99A16A494
	for <fs@freebsd.org>; Sun, 15 Oct 2006 21:08:46 +0000 (UTC)
	(envelope-from mohan_srinivasan@yahoo.com)
Received: from web30812.mail.mud.yahoo.com (web30812.mail.mud.yahoo.com
	[68.142.201.138])
	by mx1.FreeBSD.org (Postfix) with SMTP id 4F7CF43D76
	for <fs@freebsd.org>; Sun, 15 Oct 2006 21:08:43 +0000 (GMT)
	(envelope-from mohan_srinivasan@yahoo.com)
Received: (qmail 89309 invoked by uid 60001); 15 Oct 2006 21:08:42 -0000
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com;
	h=Message-ID:Received:Date:From:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding;
	b=zSDNjdpoGTccssaxhSCyIqnfgwU20SbOVRg86Ta1qBqcW5+NFMhVbjq9lfuTg9Q+hFAXQtvU2NEg+5S4Mk3/HB0rgGLtjSytfRfpAnfGa7vkK8HK1LHmxEGINXSqNraAxAUb99fydt5uEYLZIbb7YJhd2mrj9vZpLYt3fFNSaks=
	; 
Message-ID: <20061015210842.89307.qmail@web30812.mail.mud.yahoo.com>
Received: from [71.139.1.197] by web30812.mail.mud.yahoo.com via HTTP;
	Sun, 15 Oct 2006 14:08:42 PDT
Date: Sun, 15 Oct 2006 14:08:42 -0700 (PDT)
From: Mohan Srinivasan <mohan_srinivasan@yahoo.com>
To: Mohan Srinivasan <mohan_srinivasan@yahoo.com>,
	Bruce Evans <bde@zeta.org.au>, fs@freebsd.org
In-Reply-To: <20061015205923.49804.qmail@web30813.mail.mud.yahoo.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Cc: mohans@freebsd.org
Subject: Re: lost dotdot caching pessimizes nfs especially
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 15 Oct 2006 21:08:46 -0000

Bruce, 

Not sure if you are committing a change eliminating that line in nfs_open() 
that clears the attrcache. But if you're doing so, please test to make sure
you don't break close-to-open consistency.

If you're going to optimize, please give priority to correctness first.

If you are convinced that a lookup() will fetch fresh attrs in *all* cases, then
by all means go ahead and remove that line.

mohan

--- Mohan Srinivasan <mohan_srinivasan@yahoo.com> wrote:

> Bruce
> 
> Defending the "silliness" of the first of the 2 changes you cite as I am the author of that 
> change. Just got back from a short break and am still catching up on this thread.
> 
> The clearing of the attrcache on nfs_open() is a requirement for close-to-open
> consistency, and this change fixed bugs that we saw internally relating to 
> close-to-open consistency.
> 
> > and associated changes give silly behaviour that almost doubles the
> > number of Access RPCs.  One of the associated changes clears n_attrstamp
> > on close().  Then on open(), since lookup() is called before the above
> > is reached, nfs_access_otw() has always just been called, and the above
>                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > forces another call.
> 
> That is not true with NFSv2 which doesn't have an access call, in which case
> nfsspec_access() calls VOP_GETATTR, which may or may not go over the wire.
> 
> Also, what would happen with NFSv3 if we get an access cache hit ?
> 
> If lookup() can be made to pass a flag into nfs_open() that an otw getattr was
> done, then we can eliminate the clearing of the attrcache in nfs_open(). But
> absent that flag, I don't see how you can eliminate the fetch of fresh attributes
> in nfs_open().
> 
> mohan
> 


From owner-freebsd-fs@FreeBSD.ORG  Sun Oct 15 21:26:22 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: fs@freebsd.org
Delivered-To: freebsd-fs@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id F20A216A407
	for <fs@freebsd.org>; Sun, 15 Oct 2006 21:26:22 +0000 (UTC)
	(envelope-from rick@snowhite.cis.uoguelph.ca)
Received: from dargo.cs.uoguelph.ca (dargo.cs.uoguelph.ca [131.104.94.197])
	by mx1.FreeBSD.org (Postfix) with ESMTP id DD14C43D73
	for <fs@freebsd.org>; Sun, 15 Oct 2006 21:26:21 +0000 (GMT)
	(envelope-from rick@snowhite.cis.uoguelph.ca)
Received: from snowhite.cis.uoguelph.ca (snowhite.cis.uoguelph.ca
	[131.104.48.1])
	by dargo.cs.uoguelph.ca (8.13.1/8.13.1) with ESMTP id k9FLQFnp020996;
	Sun, 15 Oct 2006 17:26:15 -0400
Received: (from rick@localhost)
	by snowhite.cis.uoguelph.ca (8.9.3/8.9.3) id RAA49191;
	Sun, 15 Oct 2006 17:26:20 -0400 (EDT)
Date: Sun, 15 Oct 2006 17:26:20 -0400 (EDT)
From: rick@snowhite.cis.uoguelph.ca
Message-Id: <200610152126.RAA49191@snowhite.cis.uoguelph.ca>
To: fs@freebsd.org
X-Scanned-By: MIMEDefang 2.52 on 131.104.94.197
Cc: mohan_srinivasan@yahoo.com
Subject: Re: lost dotdot caching pessimizes nfs especially
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 15 Oct 2006 21:26:23 -0000

> The clearing of the attrcache on nfs_open() is a requirement for close-to-open
> consistency, and this change fixed bugs that we saw internally relating to 
> close-to-open consistency.

I thought I'd just throw out some comments w.r.t. close-to-open consistency.
The concept comes from the Andrew File System (before Transarc's AFS), where
the client read the entire file upon Open and wrote the entire file to the
server upon Close, if it was modified. Therefore, other clients that opened
the file after the Close were guaranteed to see the changes.

To the best of my knowledge, no NFS RFC has even required this behaviour.
It became common practice to flush writes to a server upon Close, so that
errors like ENOSPC could be returned by close(2) and a process could be
confident that the file was successfully saved if it didn't get an error
return from any write(2) syscall nor the subsequent close(2).

As a side effect of the above behaviour (not required by RFC, but common
practice), NFS clients provided "approximate close-to-open consistency".
The "approximate" came from the fact that another client wouldn't notice
that the file had been modified until its attribute cache had timed out,
a few seconds after the writing client had flushed its writes upon close.

Somewhere along the way, some people seem to have decided that
close-to-open consistency is required of NFS clients. I think the Linux crowd
is in that camp?

Since NFS doesn't have a cache coherency protocol (even for NFSv4, although
the caching rules are somewhat more explicit in RFC3530), it is always
a performance<->consistency tradeoff.

So, I guess you guys will have to decide, rick
ps: I do believe software that expects strict close-to-open consistency
    over NFS is not correct, because that is not a requirement of the RFCs.

From owner-freebsd-fs@FreeBSD.ORG  Sun Oct 15 21:35:14 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: fs@freebsd.org
Delivered-To: freebsd-fs@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 36B5C16A403
	for <fs@freebsd.org>; Sun, 15 Oct 2006 21:35:14 +0000 (UTC)
	(envelope-from mohan_srinivasan@yahoo.com)
Received: from web30813.mail.mud.yahoo.com (web30813.mail.mud.yahoo.com
	[68.142.201.139])
	by mx1.FreeBSD.org (Postfix) with SMTP id DEEB143D4C
	for <fs@freebsd.org>; Sun, 15 Oct 2006 21:35:09 +0000 (GMT)
	(envelope-from mohan_srinivasan@yahoo.com)
Received: (qmail 60931 invoked by uid 60001); 15 Oct 2006 21:35:09 -0000
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com;
	h=Message-ID:Received:Date:From:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding;
	b=Esd+M0gwv7AP+XDiSdAiLx/QKt9R1xptkpnXb4+zmqvDVZBMp8lUgpGqt4z2qitkpSA9HiQTUvh8eCW6GqEZ7KLRgvrH5HcXNPVoFyUjDfa42LjgtdZufEfD7Zt5dm8wYQxJfngdANynQ3MeSupw6FPPx+/vZH21dZ/VWOA3Jss=
	; 
Message-ID: <20061015213509.60929.qmail@web30813.mail.mud.yahoo.com>
Received: from [71.139.1.197] by web30813.mail.mud.yahoo.com via HTTP;
	Sun, 15 Oct 2006 14:35:09 PDT
Date: Sun, 15 Oct 2006 14:35:09 -0700 (PDT)
From: Mohan Srinivasan <mohan_srinivasan@yahoo.com>
To: rick@snowhite.cis.uoguelph.ca, fs@freebsd.org
In-Reply-To: <200610152126.RAA49191@snowhite.cis.uoguelph.ca>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Cc: mohan_srinivasan@yahoo.com
Subject: Re: lost dotdot caching pessimizes nfs especially
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 15 Oct 2006 21:35:14 -0000

SunOS and Solaris have provided close-to-open consistency for a very long time (for
at least 15 years now). 

Not having the NFS client enforce close-to-open consistency will break a heck of a lot
of applications. Since the other NFS clients (that matter) Solaris and Linux support it, 
I would argue that not supporting cto consistency is not really an option. 

We can however provide a mount option "nocto"  (like those clients do) that overrides
the default for specific cases (read only mounts, single client mounts etc).

mohan

--- rick@snowhite.cis.uoguelph.ca wrote:

> > The clearing of the attrcache on nfs_open() is a requirement for close-to-open
> > consistency, and this change fixed bugs that we saw internally relating to 
> > close-to-open consistency.
> 
> I thought I'd just throw out some comments w.r.t. close-to-open consistency.
> The concept comes from the Andrew File System (before Transarc's AFS), where
> the client read the entire file upon Open and wrote the entire file to the
> server upon Close, if it was modified. Therefore, other clients that opened
> the file after the Close were guaranteed to see the changes.
> 
> To the best of my knowledge, no NFS RFC has even required this behaviour.
> It became common practice to flush writes to a server upon Close, so that
> errors like ENOSPC could be returned by close(2) and a process could be
> confident that the file was successfully saved if it didn't get an error
> return from any write(2) syscall nor the subsequent close(2).
> 
> As a side effect of the above behaviour (not required by RFC, but common
> practice), NFS clients provided "approximate close-to-open consistency".
> The "approximate" came from the fact that another client wouldn't notice
> that the file had been modified until its attribute cache had timed out,
> a few seconds after the writing client had flushed its writes upon close.
> 
> Somewhere along the way, some people seem to have decided that
> close-to-open consistency is required of NFS clients. I think the Linux crowd
> is in that camp?
> 
> Since NFS doesn't have a cache coherency protocol (even for NFSv4, although
> the caching rules are somewhat more explicit in RFC3530), it is always
> a performance<->consistency tradeoff.
> 
> So, I guess you guys will have to decide, rick
> ps: I do believe software that expects strict close-to-open consistency
>     over NFS is not correct, because that is not a requirement of the RFCs.
> 


From owner-freebsd-fs@FreeBSD.ORG  Mon Oct 16 06:30:02 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: fs@freebsd.org
Delivered-To: freebsd-fs@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 48A0116A47C;
	Mon, 16 Oct 2006 06:30:02 +0000 (UTC) (envelope-from bde@zeta.org.au)
Received: from mailout1.pacific.net.au (mailout1-3.pacific.net.au [61.8.2.210])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 5BAE743D5E;
	Mon, 16 Oct 2006 06:30:01 +0000 (GMT) (envelope-from bde@zeta.org.au)
Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au
	[61.8.2.163])
	by mailout1.pacific.net.au (Postfix) with ESMTP id D3A99328DA5;
	Mon, 16 Oct 2006 16:29:59 +1000 (EST)
Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246])
	by mailproxy2.pacific.net.au (Postfix) with ESMTP id 90A642741E;
	Mon, 16 Oct 2006 16:29:58 +1000 (EST)
Date: Mon, 16 Oct 2006 16:29:57 +1000 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@delplex.bde.org
To: Mohan Srinivasan <mohan_srinivasan@yahoo.com>
In-Reply-To: <20061015205923.49804.qmail@web30813.mail.mud.yahoo.com>
Message-ID: <20061016130540.C63585@delplex.bde.org>
References: <20061015205923.49804.qmail@web30813.mail.mud.yahoo.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: fs@freebsd.org, mohans@freebsd.org
Subject: Re: lost dotdot caching pessimizes nfs especially
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 16 Oct 2006 06:30:02 -0000

On Sun, 15 Oct 2006, Mohan Srinivasan wrote:

> The clearing of the attrcache on nfs_open() is a requirement for close-to-open
> consistency, and this change fixed bugs that we saw internally relating to
> close-to-open consistency.
>
>> and associated changes give silly behaviour that almost doubles the
>> number of Access RPCs.  One of the associated changes clears n_attrstamp
>> on close().  Then on open(), since lookup() is called before the above
>> is reached, nfs_access_otw() has always just been called, and the above
>                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> forces another call.
>
> That is not true with NFSv2 which doesn't have an access call, in which case
> nfsspec_access() calls VOP_GETATTR, which may or may not go over the wire.
>
> Also, what would happen with NFSv3 if we get an access cache hit ?

I didn't think about NFSv2 or check the details for NFSv3 until now.
It is nfs_lookup() that always calls VOP_GETATTR(), and VOP_GETATTR()
must go on the wire in the case being described (lookup after close)
since we flushed the attribute cache entry in nfs_close().  The
difference for v2 is that nfs_getattr() normally uses a Getattr request
for v2 and an Access request for v3.

For NFSv3, nfs_lookup()'s behaviour is correct for the attribute cache
is not as good as it could easily be for the attribute cache.  In
nfs_lookup() after a recent close(), in the usual cases all caches are
hit except we just cleared the attribute cache, so nfs_lookup() does
the following:

     VOP_ACCESS();        # Cache hit.  Access granted.
     cache_lookup();      # Positive cache hit.
     VOP_GETATTR();       # Cache miss.  Succeeds.
     # Now we have fresh attributes in the v3 case, but we granted access
     # based on the old attributes, so we unnecessarily lost full
     # open/close consistency.

In unusual cases, there is an acccess cache miss.  Then for v3,
VOP_ACCESS() refreshes the attribute cache too, VOP_GETTATR() is a
cache hit, and there is full open/close consistency.

> If lookup() can be made to pass a flag into nfs_open() that an otw getattr was
> done, then we can eliminate the clearing of the attrcache in nfs_open(). But
> absent that flag, I don't see how you can eliminate the fetch of fresh attributes
> in nfs_open().

Of course something like such a flag is needed.  See my previous mail for
more details (there should be another flag for nfs_lookup() so that the
entire open() is consistent).  For nfs_open(), I was thinking more of
a generation count.  Now I wonder about exclusive locking and blockages.
VOP_OPEN() is now exclusively locked, but I don't now if the same lock
covers the lookup.  With exclusive locking, not even a flag is needed.
Without exclusive locking, blocking might be a problem.

Bruce

From owner-freebsd-fs@FreeBSD.ORG  Mon Oct 16 06:41:11 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: fs@freebsd.org
Delivered-To: freebsd-fs@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 5EAAC16A40F
	for <fs@freebsd.org>; Mon, 16 Oct 2006 06:41:11 +0000 (UTC)
	(envelope-from bde@zeta.org.au)
Received: from mailout1.pacific.net.au (mailout1-3.pacific.net.au [61.8.2.210])
	by mx1.FreeBSD.org (Postfix) with ESMTP id F107A43D53
	for <fs@freebsd.org>; Mon, 16 Oct 2006 06:41:10 +0000 (GMT)
	(envelope-from bde@zeta.org.au)
Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au
	[61.8.2.162])
	by mailout1.pacific.net.au (Postfix) with ESMTP id 3CCBE69FDAF;
	Mon, 16 Oct 2006 16:41:10 +1000 (EST)
Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246])
	by mailproxy1.pacific.net.au (Postfix) with ESMTP id 57F858C3A;
	Mon, 16 Oct 2006 16:41:09 +1000 (EST)
Date: Mon, 16 Oct 2006 16:41:08 +1000 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@delplex.bde.org
To: Mohan Srinivasan <mohan_srinivasan@yahoo.com>
In-Reply-To: <20061015213509.60929.qmail@web30813.mail.mud.yahoo.com>
Message-ID: <20061016163015.C63585@delplex.bde.org>
References: <20061015213509.60929.qmail@web30813.mail.mud.yahoo.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: fs@freebsd.org, rick@snowhite.cis.uoguelph.ca
Subject: Re: lost dotdot caching pessimizes nfs especially
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 16 Oct 2006 06:41:11 -0000

On Sun, 15 Oct 2006, Mohan Srinivasan wrote:

> of applications. Since the other NFS clients (that matter) Solaris and Linux support it,
> I would argue that not supporting cto consistency is not really an option.

I agree.

> We can however provide a mount option "nocto"  (like those clients do) that overrides
> the default for specific cases (read only mounts, single client mounts etc).

PR 78673 has a patch to break consistency unconditionally for r/o
mounts.  I use this, but it doesn't help for my most active file system
(/usr/obj) since that needs to be r/w.  It is obviously wrong to do
this unconditonally on the client.  It is the server's read-onlyness
that matters.  I don't know how to track the server's read-onlyness
short of asking it on every open() or Access.

Bruce

From owner-freebsd-fs@FreeBSD.ORG  Mon Oct 16 15:32:26 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: fs@freebsd.org
Delivered-To: freebsd-fs@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 76A0F16A403
	for <fs@freebsd.org>; Mon, 16 Oct 2006 15:32:26 +0000 (UTC)
	(envelope-from rick@snowhite.cis.uoguelph.ca)
Received: from dargo.cs.uoguelph.ca (dargo.cs.uoguelph.ca [131.104.94.197])
	by mx1.FreeBSD.org (Postfix) with ESMTP id C58B443D76
	for <fs@freebsd.org>; Mon, 16 Oct 2006 15:32:24 +0000 (GMT)
	(envelope-from rick@snowhite.cis.uoguelph.ca)
Received: from snowhite.cis.uoguelph.ca (snowhite.cis.uoguelph.ca
	[131.104.48.1])
	by dargo.cs.uoguelph.ca (8.13.1/8.13.1) with ESMTP id k9GFWLtJ013264;
	Mon, 16 Oct 2006 11:32:21 -0400
Received: (from rick@localhost)
	by snowhite.cis.uoguelph.ca (8.9.3/8.9.3) id LAA59652;
	Mon, 16 Oct 2006 11:32:36 -0400 (EDT)
Date: Mon, 16 Oct 2006 11:32:36 -0400 (EDT)
From: rick@snowhite.cis.uoguelph.ca
Message-Id: <200610161532.LAA59652@snowhite.cis.uoguelph.ca>
To: fs@freebsd.org
X-Scanned-By: MIMEDefang 2.52 on 131.104.94.197
Cc: mohan_srinivasan@yahoo.com
Subject: Re: lost dotdot caching pessimizes nfs especially
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 16 Oct 2006 15:32:26 -0000

> SunOS and Solaris have provided close-to-open consistency for a very long time (for
> at least 15 years now). 

I believe you (and I understand the principal "if Solaris does X, that's the
way it needs to be"). Until very recently, Solaris sources weren't open and
I didn't have access to them. That was important in the "bad old days", since
Sun's Legal Beagles did once send a threatening letter w.r.t. my NFS
violating their proprietary...

I'd argue that these days it's "if Linux does X, that's the way it needs
to be".:-)

> Not having the NFS client enforce close-to-open consistency will break a heck of a lot
> of applications. Since the other NFS clients (that matter) Solaris and Linux support it, 
> I would argue that not supporting cto consistency is not really an option. 

I am a bit surprised that a lot of applications break. To do so, they must
be running on multiple clients, read/write sharing the same NFS mounted file(s)
and use some back-end protocol that says something like "I've just closed it,
so you can now open it" so inconsistencies < 1 minute, causes problems.
I wish those applications had been common a decade ago becuase the NFS
community might have cared about cache consistency and something along
the lines of my experimental NQNFS might have happenned. NFSv4 simply
says that clients that care about cache consistency should use byte
range locking. Here's what RFC3530 (the NFSv4 RFC) says:

at top of Page 14:
   If an application wants to serialize access to file data, file
   locking of the file data ranges in question should be used.

Admittedly the paragraph that preceeds this almost says what Solaris is doing
and seems to contradict Sec 9:
9.  Client-Side Caching

   Client-side caching of data, of file attributes, and of file names is
   essential to providing good performance with the NFS protocol.
   Providing distributed cache coherence is a difficult problem and
   previous versions of the NFS protocol have not attempted it.
   Instead, several NFS client implementation techniques have been used
   to reduce the problems that a lack of coherence poses for users.
   These techniques have not been clearly defined by earlier protocol
   specifications and it is often unclear what is valid or invalid
   client behavior.

   The NFS version 4 protocol uses many techniques similar to those that
   have been used in previous protocol versions.  The NFS version 4
   protocol does not provide distributed cache coherence.  However, it
   defines a more limited set of caching guarantees to allow locks and
   share reservations to be used without destructive interference from
   client side caching.

   In addition, the NFS version 4 protocol introduces a delegation
   mechanism which allows many decisions normally made by the server to
   be made locally by clients.  This mechanism provides efficient
   support of the common cases where sharing is infrequent or where
   sharing is read-only.

Once clients figure out how to effectively use Delegations, I think there
will be significant performance improvements. Unfortunately, we haven't
gotten there yet.

For NFSv2/3, maintaining close-to-open consistency may be appropriate
(and necessary), but it can result in a big performance hit. For example,
try an experiment where you turn off "push writes on close" in the code
and see what effect that has on performance for the common, non-read/write
shared files case. (nb: I don't think you can get away with doing this
without some other cache consistency guarantees, but it would sure be
nice from a performance point of view. That was what nqnfs was all about.
If you are really bored, you can go to http://www.cis.uoguelph.ca/~nfsv4 and
read my ancient nqnfs paper.)

rick

From owner-freebsd-fs@FreeBSD.ORG  Mon Oct 16 21:42:53 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@freebsd.org
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 5E88516A417
	for <freebsd-fs@freebsd.org>; Mon, 16 Oct 2006 21:42:53 +0000 (UTC)
	(envelope-from lgusenet@be-well.ilk.org)
Received: from mail6.sea5.speakeasy.net (mail6.sea5.speakeasy.net
	[69.17.117.8]) by mx1.FreeBSD.org (Postfix) with ESMTP id 3629843D67
	for <freebsd-fs@freebsd.org>; Mon, 16 Oct 2006 21:42:35 +0000 (GMT)
	(envelope-from lgusenet@be-well.ilk.org)
Received: (qmail 10451 invoked from network); 16 Oct 2006 21:42:35 -0000
Received: from dsl092-078-145.bos1.dsl.speakeasy.net (HELO be-well.ilk.org)
	([66.92.78.145]) (envelope-sender <lgusenet@be-well.ilk.org>)
	by mail6.sea5.speakeasy.net (qmail-ldap-1.03) with SMTP
	for <freebsd-fs@freebsd.org>; 16 Oct 2006 21:42:35 -0000
Received: by be-well.ilk.org (Postfix, from userid 1147)
	id CC7632842E; Mon, 16 Oct 2006 17:42:33 -0400 (EDT)
To: freebsd-fs@freebsd.org
To: absorbb@gmail.com (=?utf-8?B?0JjQu9GM0LTQsNGAINCd0YPRgNC40YHQu9Cw0Lw=?=
	=?utf-8?B?0L7Qsg==?=)
References: <200609292159.18282.absorbb@gmail.com>
From: Lowell Gilbert <lgfbsd@be-well.ilk.org>
Date: Mon, 16 Oct 2006 17:42:33 -0400
In-Reply-To: <200609292159.18282.absorbb@gmail.com> (=?utf-8?B?0JjQu9GM?=
	=?utf-8?B?0LTQsNGAINCd0YPRgNC40YHQu9Cw0LzQvtCyJ3M=?= message of "Fri, 29
	Sep 2006 21:59:17 +0400")
Message-ID: <44d58sos3a.fsf@be-well.ilk.org>
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.0.50 (berkeley-unix)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Cc: 
Subject: Re: ntfs broken when share through samba3
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: freebsd-fs@freebsd.org
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 16 Oct 2006 21:42:53 -0000

absorbb@gmail.com (=D0=98=D0=BB=D1=8C=D0=B4=D0=B0=D1=80 =D0=9D=D1=83=D1=80=
=D0=B8=D1=81=D0=BB=D0=B0=D0=BC=D0=BE=D0=B2) writes:

> This old already reported bug.
> But situation have'nt changed.

For example, kern/86965.

> There is very simple patch that fix this bug:
>
> --- usr/src/sys/fs/ntfs/ntfs_vnops.c	Mon Mar 13 00:50:01 2006
> +++ home/voxel/stuff/ntfs_vnops.c	Thu Aug 31 09:22:08 2006
> @@ -187,7 +187,8 @@
>  	vap->va_fsid =3D dev2udev(ip->i_dev);
>  	vap->va_fileid =3D ip->i_number;
>  	vap->va_mode =3D ip->i_mp->ntm_mode;
> -	vap->va_nlink =3D ip->i_nlink;
> +	vap->va_nlink =3D (ip->i_nlink ? ip->i_nlink : 1);
> +	//vap->va_nlink =3D ip->i_nlink;
>  	vap->va_uid =3D ip->i_mp->ntm_uid;
>  	vap->va_gid =3D ip->i_mp->ntm_gid;
>  	vap->va_rdev =3D 0;				/* XXX UNODEV ? */
>
> but it seems to be not beaty solution

Not beautiful, indeed.=20=20

I was playing around with this, and although that change would work
around the problem in (at least) most cases, I am not sure that it is
truly correct.

I am not an expert at filesystems, and certainly have little knowledge
of NTFS.  However, my observations confuse me considerably.  The main
issue is that if you read from a file (on NTFS, with a link count of
zero according to ls(1)), the link count becomes populated.  I cannot
see how that would happen, because the ntnode structure link count is
not modified except when reading the whole structure from the disk,
and the on-disk node is not being changed.  To confuse things further,
the link count is changed to 2, not 1, on ordinary files that have
only a single directory entry.  I do not believe that streams are at
issue, because the file has no open file descriptors remaining
according to fstat(1).

Any thoughts from the experts?

Might Darwin have any useful hints?

From owner-freebsd-fs@FreeBSD.ORG  Mon Oct 16 22:05:48 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: fs@freebsd.org
Delivered-To: freebsd-fs@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 143DB16A51A
	for <fs@freebsd.org>; Mon, 16 Oct 2006 22:05:48 +0000 (UTC)
	(envelope-from mohan_srinivasan@yahoo.com)
Received: from web30804.mail.mud.yahoo.com (web30804.mail.mud.yahoo.com
	[68.142.200.147])
	by mx1.FreeBSD.org (Postfix) with SMTP id EAC0D43D95
	for <fs@freebsd.org>; Mon, 16 Oct 2006 22:05:31 +0000 (GMT)
	(envelope-from mohan_srinivasan@yahoo.com)
Received: (qmail 49387 invoked by uid 60001); 16 Oct 2006 22:05:31 -0000
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com;
	h=Message-ID:Received:Date:From:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding;
	b=xc6p1XFyVmMpPrOVhh7bgGqZpUDiOt26igmfMqAQ1AlPWxGJEL7+2dVTwDKCZwew9pJKSKFl72s16vGPwH44DY9+YpipSarrbd71OiwJ31ib3yqYyNpuZrWEQ2cDeOYyNNpKcJ9GlY3q0CZ82kwOBUpgSoyfCu9PXiDGVvBfMcw=
	; 
Message-ID: <20061016220531.49385.qmail@web30804.mail.mud.yahoo.com>
Received: from [207.126.239.39] by web30804.mail.mud.yahoo.com via HTTP;
	Mon, 16 Oct 2006 15:05:31 PDT
Date: Mon, 16 Oct 2006 15:05:31 -0700 (PDT)
From: Mohan Srinivasan <mohan_srinivasan@yahoo.com>
To: rick@snowhite.cis.uoguelph.ca, fs@freebsd.org
In-Reply-To: <200610161532.LAA59652@snowhite.cis.uoguelph.ca>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Cc: mohan_srinivasan@yahoo.com
Subject: Re: lost dotdot caching pessimizes nfs especially
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 16 Oct 2006 22:05:48 -0000

Hi Rick

--- rick@snowhite.cis.uoguelph.ca wrote:

> > SunOS and Solaris have provided close-to-open consistency for a very long time (for
> > at least 15 years now). 
> 
> I believe you (and I understand the principal "if Solaris does X, that's the
> way it needs to be"). Until very recently, Solaris sources weren't open and
> I didn't have access to them. That was important in the "bad old days", since
> Sun's Legal Beagles did once send a threatening letter w.r.t. my NFS
> violating their proprietary...
> 
> I'd argue that these days it's "if Linux does X, that's the way it needs
> to be".:-)

Probably so. But to my mind, Solaris still has the most robust NFS client implementation
out there (I have no affiliation with Sun whatsoever), which is why I mentioned Solaris.
I have not looked at Linux wrt cto consistency.

> > Not having the NFS client enforce close-to-open consistency will break a heck of a lot
> > of applications. Since the other NFS clients (that matter) Solaris and Linux support it, 
> > I would argue that not supporting cto consistency is not really an option. 
> 
> I am a bit surprised that a lot of applications break. To do so, they must
> be running on multiple clients, read/write sharing the same NFS mounted file(s)
> and use some back-end protocol that says something like "I've just closed it,
> so you can now open it" so inconsistencies < 1 minute, causes problems.

Such applications are very common. 

1) An application where multiple clients can do something like this :

   Acquire a file lock;
   open();
   Do I/O
   close();
   Drop the file lock;

   Without cto consistency, there's no way this is going to work.

   That is what a large application where I am employed does. And I would 
   think that would be very common case elsewhere too.

(You can replace the file lock with a byte range lock post-open(), but that
won't change the result).

2) Without cto consistency, something as simple as editing a file on one
   client and compiling it on another won't work anymore. Breaking this is
   sure to send people with pitchforks running after the perpetrator :)

> NFSv4 simply
> says that clients that care about cache consistency should use byte
> range locking. Here's what RFC3530 (the NFSv4 RFC) says:

I don't know anything about NFSv4 (or Delegations). I figured that by the time
NFSv4 gains wide acceptance, I'd be happily retired :), so I haven't bothered.
I am mostly interested in a robust NFSv3 implementation :)

At least in the context of NFSv2/3, byte range locking is a necessary but 
not sufficient condition for correctness. For correctness, you require
Byte Range Locking + Direct IO (or some equivalent that bypasses client 
caching). 

> For NFSv2/3, maintaining close-to-open consistency may be appropriate
> (and necessary), but it can result in a big performance hit. 

Cool. So we agree on that :)

I agree there's a performance hit. Which is unavoidable. Best we can do is
mitigate it with things like a nocto mount option, adding namei flags that
NFS can set if it did an otw getattr from the lookup etc.

mohan

From owner-freebsd-fs@FreeBSD.ORG  Mon Oct 16 23:59:01 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: fs@freebsd.org
Delivered-To: freebsd-fs@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 90BC416A412
	for <fs@freebsd.org>; Mon, 16 Oct 2006 23:59:01 +0000 (UTC)
	(envelope-from andrew@areilly.bpa.nu)
Received: from omta03sl.mx.bigpond.com (omta03sl.mx.bigpond.com
	[144.140.92.155])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 72BA143D5C
	for <fs@freebsd.org>; Mon, 16 Oct 2006 23:58:59 +0000 (GMT)
	(envelope-from andrew@areilly.bpa.nu)
Received: from areilly.bpa.nu ([141.168.2.3]) by omta03sl.mx.bigpond.com
	with ESMTP
	id <20061016235857.VHUV8785.omta03sl.mx.bigpond.com@areilly.bpa.nu>
	for <fs@freebsd.org>; Mon, 16 Oct 2006 23:58:57 +0000
Received: (qmail 38750 invoked by uid 501); 16 Oct 2006 23:56:58 -0000
Date: Tue, 17 Oct 2006 09:56:58 +1000
From: Andrew Reilly <andrew-freebsd@areilly.bpc-users.org>
To: Mohan Srinivasan <mohan_srinivasan@yahoo.com>
Message-ID: <20061016235658.GA38613@duncan.reilly.home>
References: <200610161532.LAA59652@snowhite.cis.uoguelph.ca>
	<20061016220531.49385.qmail@web30804.mail.mud.yahoo.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20061016220531.49385.qmail@web30804.mail.mud.yahoo.com>
User-Agent: Mutt/1.4.2.2i
Cc: fs@freebsd.org, rick@snowhite.cis.uoguelph.ca
Subject: Re: lost dotdot caching pessimizes nfs especially
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 16 Oct 2006 23:59:01 -0000

On Mon, Oct 16, 2006 at 03:05:31PM -0700, Mohan Srinivasan wrote:
> 2) Without cto consistency, something as simple as editing a file on one
>    client and compiling it on another won't work anymore. Breaking this is
>    sure to send people with pitchforks running after the perpetrator :)

When I was doing lots of NFS-hosted development, years ago, I
quickly learned to edit on the same machine as I was building
on.  X makes that easy.  That was in the late-80s-early-90s time
frame: I haven't had much use for NFS since then.  That's
changing now, though.

That's not to say that this is the way that it should be.  Just
that it's the way that it always used to be.

Cheers,

-- 
Andrew

From owner-freebsd-fs@FreeBSD.ORG  Tue Oct 17 18:13:41 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@freebsd.org
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 4ACFE16A403
	for <freebsd-fs@freebsd.org>; Tue, 17 Oct 2006 18:13:41 +0000 (UTC)
	(envelope-from mday@apple.com)
Received: from mail-out3.apple.com (mail-out3.apple.com [17.254.13.22])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 54A2F43D5A
	for <freebsd-fs@freebsd.org>; Tue, 17 Oct 2006 18:13:32 +0000 (GMT)
	(envelope-from mday@apple.com)
Received: from relay8.apple.com (a17-128-113-38.apple.com [17.128.113.38])
	by mail-out3.apple.com (8.12.11/8.12.11) with ESMTP id k9HIDWA2021194
	for <freebsd-fs@freebsd.org>; Tue, 17 Oct 2006 11:13:32 -0700 (PDT)
Received: from [17.202.43.217] (unknown [17.202.43.217])
	by relay8.apple.com (Apple SCV relay) with ESMTP id EAB69638
	for <freebsd-fs@freebsd.org>; Tue, 17 Oct 2006 11:13:31 -0700 (PDT)
Message-Id: <AEDCDDFB-CAD9-4F33-92FE-FA3D3E30F485@apple.com>
From: Mark Day <mday@apple.com>
To: freebsd-fs@freebsd.org
In-Reply-To: <44d58sos3a.fsf@be-well.ilk.org>
Content-Type: text/plain; charset=UTF-8; format=flowed; delsp=yes
X-Smtp-Server: relay.apple.com
Mime-Version: 1.0 (Apple Message framework v851)
Content-Transfer-Encoding: quoted-printable
Date: Tue, 17 Oct 2006 11:13:31 -0700
References: <200609292159.18282.absorbb@gmail.com>
	<44d58sos3a.fsf@be-well.ilk.org>
X-Mailer: Apple Mail (2.851)
X-Brightmail-Tracker: AAAAAA==
Subject: Re: ntfs broken when share through samba3
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 17 Oct 2006 18:13:41 -0000

On Oct 16, 2006, at 2:42 PM, Lowell Gilbert wrote:

> absorbb@gmail.com (=D0=98=D0=BB=D1=8C=D0=B4=D0=B0=D1=80 =
=D0=9D=D1=83=D1=80=D0=B8=D1=81=D0=BB=D0=B0=D0=BC=D0=BE=D0=B2) writes:
>
>> This old already reported bug.
>> But situation have'nt changed.
>
> For example, kern/86965.
>
>> There is very simple patch that fix this bug:
>>
>> --- usr/src/sys/fs/ntfs/ntfs_vnops.c	Mon Mar 13 00:50:01 2006
>> +++ home/voxel/stuff/ntfs_vnops.c	Thu Aug 31 09:22:08 2006
>> @@ -187,7 +187,8 @@
>>  	vap->va_fsid =3D dev2udev(ip->i_dev);
>>  	vap->va_fileid =3D ip->i_number;
>>  	vap->va_mode =3D ip->i_mp->ntm_mode;
>> -	vap->va_nlink =3D ip->i_nlink;
>> +	vap->va_nlink =3D (ip->i_nlink ? ip->i_nlink : 1);
>> +	//vap->va_nlink =3D ip->i_nlink;
>>  	vap->va_uid =3D ip->i_mp->ntm_uid;
>>  	vap->va_gid =3D ip->i_mp->ntm_gid;
>>  	vap->va_rdev =3D 0;				/* XXX UNODEV ? =
*/
>>
>> but it seems to be not beaty solution
>
> Not beautiful, indeed.
>
> I was playing around with this, and although that change would work
> around the problem in (at least) most cases, I am not sure that it is
> truly correct.
>
> I am not an expert at filesystems, and certainly have little knowledge
> of NTFS.  However, my observations confuse me considerably.  The main
> issue is that if you read from a file (on NTFS, with a link count of
> zero according to ls(1)), the link count becomes populated.  I cannot
> see how that would happen, because the ntnode structure link count is
> not modified except when reading the whole structure from the disk,
> and the on-disk node is not being changed.  To confuse things further,
> the link count is changed to 2, not 1, on ordinary files that have
> only a single directory entry.  I do not believe that streams are at
> issue, because the file has no open file descriptors remaining
> according to fstat(1).

IIRC, the NTFS code tries to populate a vnode based on the limited =20
information present in the directory entries it sees.  It's trying to =20=

avoid having to go read the Master File Table record (the i-node =20
equivalent) until it actually needs that information (such as the link =20=

count).  The ntfs_loadntnode() routine will read in the MFT record and =20=

populate the rest of the vnode's fields.  There's a flag =20
(VG_DONTLOADIN) to pass to ntfs_vgetex to control whether the MFT-=20
based fields get filled in when get the vnode.

Hope this helps,
-Mark


From owner-freebsd-fs@FreeBSD.ORG  Tue Oct 17 21:09:50 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@freebsd.org
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 7A08916A40F
	for <freebsd-fs@freebsd.org>; Tue, 17 Oct 2006 21:09:50 +0000 (UTC)
	(envelope-from lgusenet@be-well.ilk.org)
Received: from mail8.sea5.speakeasy.net (mail8.sea5.speakeasy.net
	[69.17.117.10]) by mx1.FreeBSD.org (Postfix) with ESMTP id 7559143D76
	for <freebsd-fs@freebsd.org>; Tue, 17 Oct 2006 21:09:43 +0000 (GMT)
	(envelope-from lgusenet@be-well.ilk.org)
Received: (qmail 21923 invoked from network); 17 Oct 2006 21:09:43 -0000
Received: from dsl092-078-145.bos1.dsl.speakeasy.net (HELO be-well.ilk.org)
	([66.92.78.145]) (envelope-sender <lgusenet@be-well.ilk.org>)
	by mail8.sea5.speakeasy.net (qmail-ldap-1.03) with SMTP
	for <freebsd-fs@freebsd.org>; 17 Oct 2006 21:09:43 -0000
Received: by be-well.ilk.org (Postfix, from userid 1147)
	id 813BC28433; Tue, 17 Oct 2006 17:09:42 -0400 (EDT)
To: freebsd-fs@freebsd.org
References: <200609292159.18282.absorbb@gmail.com>
	<44d58sos3a.fsf@be-well.ilk.org>
	<AEDCDDFB-CAD9-4F33-92FE-FA3D3E30F485@apple.com>
From: Lowell Gilbert <lgfbsd@be-well.ilk.org>
Date: Tue, 17 Oct 2006 17:09:42 -0400
In-Reply-To: <AEDCDDFB-CAD9-4F33-92FE-FA3D3E30F485@apple.com> (Mark Day's
	message of "Tue, 17 Oct 2006 11:13:31 -0700")
Message-ID: <44d58qmyy1.fsf@be-well.ilk.org>
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.0.50 (berkeley-unix)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Subject: Re: ntfs broken when share through samba3
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 17 Oct 2006 21:09:50 -0000

mday@apple.com (Mark Day) writes:

> On Oct 16, 2006, at 2:42 PM, Lowell Gilbert wrote:
>
>> absorbb@gmail.com (=D0=98=D0=BB=D1=8C=D0=B4=D0=B0=D1=80 =D0=9D=D1=83=D1=
=80=D0=B8=D1=81=D0=BB=D0=B0=D0=BC=D0=BE=D0=B2) writes:
>>
>>> This old already reported bug.
>>> But situation have'nt changed.
>>
>> For example, kern/86965.
>>
>>> There is very simple patch that fix this bug:
>>>
>>> --- usr/src/sys/fs/ntfs/ntfs_vnops.c	Mon Mar 13 00:50:01 2006
>>> +++ home/voxel/stuff/ntfs_vnops.c	Thu Aug 31 09:22:08 2006
>>> @@ -187,7 +187,8 @@
>>>  	vap->va_fsid =3D dev2udev(ip->i_dev);
>>>  	vap->va_fileid =3D ip->i_number;
>>>  	vap->va_mode =3D ip->i_mp->ntm_mode;
>>> -	vap->va_nlink =3D ip->i_nlink;
>>> +	vap->va_nlink =3D (ip->i_nlink ? ip->i_nlink : 1);
>>> +	//vap->va_nlink =3D ip->i_nlink;
>>>  	vap->va_uid =3D ip->i_mp->ntm_uid;
>>>  	vap->va_gid =3D ip->i_mp->ntm_gid;
>>>  	vap->va_rdev =3D 0;				/* XXX UNODEV ? */
>>>
>>> but it seems to be not beaty solution
>>
>> Not beautiful, indeed.
>>
>> I was playing around with this, and although that change would work
>> around the problem in (at least) most cases, I am not sure that it is
>> truly correct.
>>
>> I am not an expert at filesystems, and certainly have little knowledge
>> of NTFS.  However, my observations confuse me considerably.  The main
>> issue is that if you read from a file (on NTFS, with a link count of
>> zero according to ls(1)), the link count becomes populated.  I cannot
>> see how that would happen, because the ntnode structure link count is
>> not modified except when reading the whole structure from the disk,
>> and the on-disk node is not being changed.  To confuse things further,
>> the link count is changed to 2, not 1, on ordinary files that have
>> only a single directory entry.  I do not believe that streams are at
>> issue, because the file has no open file descriptors remaining
>> according to fstat(1).
>
> IIRC, the NTFS code tries to populate a vnode based on the limited
> information present in the directory entries it sees.  It's trying to
> avoid having to go read the Master File Table record (the i-node
> equivalent) until it actually needs that information (such as the link
> count).  The ntfs_loadntnode() routine will read in the MFT record and
> populate the rest of the vnode's fields.  There's a flag
> (VG_DONTLOADIN) to pass to ntfs_vgetex to control whether the MFT-=20
> based fields get filled in when get the vnode.
>
> Hope this helps,

Yes, that clears things up for me considerably.  Thank you, Mark.

One thing I can see from that is that the proper loading of the ntnode
from the MFT will not be affected by faking a link count; the
IN_LOADED flag will take care of that.  Furthermore, using that same
flag to add to this patch, so that a zero count in the MFT will not be
ignored, will avoid the rest of my major concerns.  The patch would
end up more like:

 --- usr/src/sys/fs/ntfs/ntfs_vnops.c	Mon Mar 13 00:50:01 2006
 +++ home/voxel/stuff/ntfs_vnops.c	Thu Aug 31 09:22:08 2006
 @@ -187,7 +187,8 @@
  	vap->va_fsid =3D dev2udev(ip->i_dev);
  	vap->va_fileid =3D ip->i_number;
  	vap->va_mode =3D ip->i_mp->ntm_mode;
 -	vap->va_nlink =3D ip->i_nlink;
 +	vap->va_nlink =3D (ip->i_nlink || ip->i_flag & IN_LOADED ? ip->i_nlink :=
 1);
  	vap->va_uid =3D ip->i_mp->ntm_uid;
  	vap->va_gid =3D ip->i_mp->ntm_gid;
  	vap->va_rdev =3D 0;				/* XXX UNODEV ? */

This still doesn't meet POSIX requirements, but to do that would
require reading the whole MFT entry every time, instead of just the
directory entries.  That optimization speeds things up a lot in large
filename searches, so this seems like a good compromise to me.

Or am I missing something?

From owner-freebsd-fs@FreeBSD.ORG  Wed Oct 18 06:40:51 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: fs@FreeBSD.org
Delivered-To: freebsd-fs@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 0184E16A403;
	Wed, 18 Oct 2006 06:40:51 +0000 (UTC) (envelope-from bde@zeta.org.au)
Received: from mailout2.pacific.net.au (mailout2-3.pacific.net.au [61.8.2.226])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 4143043D55;
	Wed, 18 Oct 2006 06:40:50 +0000 (GMT) (envelope-from bde@zeta.org.au)
Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au
	[61.8.2.163])
	by mailout2.pacific.net.au (Postfix) with ESMTP id 6FFE06E826;
	Wed, 18 Oct 2006 16:40:46 +1000 (EST)
Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246])
	by mailproxy2.pacific.net.au (Postfix) with ESMTP id 73A132744E;
	Wed, 18 Oct 2006 16:40:44 +1000 (EST)
Date: Wed, 18 Oct 2006 16:40:43 +1000 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@delplex.bde.org
To: Chuck Lever <chucklever@gmail.com>
In-Reply-To: <20061017113943.C67620@delplex.bde.org>
Message-ID: <20061018153336.E72684@delplex.bde.org>
References: <200610140725.k9E7PC37008454@repoman.freebsd.org> 
	<20061014231502.GA38708@rink.nu>
	<20061015105809.M59123@delplex.bde.org> 
	<20061015051044.GA42764@xor.obsecurity.org>
	<20061014222221.H97880@ns1.feral.com>
	<20061014222437.N4701@ns1.feral.com>
	<20061015153454.G59979@delplex.bde.org>
	<76bd70e30610150837w61689cf6ya2499d100a15c3e8@mail.gmail.com> 
	<20061016164122.S63585@delplex.bde.org>
	<76bd70e30610160620x67e5d3a5j938c26744d0b9759@mail.gmail.com>
	<20061017113943.C67620@delplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: mjacob@FreeBSD.org, fs@FreeBSD.org, Kris Kennaway <kris@obsecurity.org>
Subject: negative cache hits for nfs (was: cvs commit: ...)
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 18 Oct 2006 06:40:51 -0000

[I changed the Cc from cvs* to fs]

On Tue, 17 Oct 2006, Bruce Evans wrote:

> On Mon, 16 Oct 2006, Chuck Lever wrote:
>
>> On 10/16/06, Bruce Evans <bde@zeta.org.au> wrote:
>>> On Sun, 15 Oct 2006, Chuck Lever wrote:
>>>> [An independent imeout for the access cache isn't useful.]
>>> 
>>> I'll try removing the special support for the access cache timeout in
>>> rc.conf first.
>> 
>> OK.  I can review patches if you think that would help, but I can't
>> contribute code at the moment because of IP issues at my current
>> employer.  Hopefully that will change soon.
>
> Thanks.  Removing it in rc.conf won't require review :-).

>> ...
>> Another thing to consider is that a LOOKUP is usually more expensive
>> for servers than a GETATTR.  If your client has already cached lookup
>> results for the file to be opened, you can get away with a GETATTR on
>> the parent directory to verify that it has not changed, and that will
>> almost always be faster than doing a full LOOKUP.
>
> FreeBSD's client is doing not very good things for Lookup too.  It is
> missing caching of negative lookups.  make(1) likes to do a lot of
> negative lookups...  NetBSD fixed this in 1997, sigh.

Here is a merge of some bits from NetBSD for review.  It is mostly the
1997 version, with updates to use timespecs instead of time_t's, but
not updates to use changes that don't seem to be related to correctness,
or ones less than 18 months old (if any).

% Index: nfs_vnops.c
% ===================================================================
% RCS file: /home/ncvs/src/sys/nfsclient/nfs_vnops.c,v
% retrieving revision 1.270
% diff -u -2 -r1.270 nfs_vnops.c
% --- nfs_vnops.c	14 Oct 2006 07:25:11 -0000	1.270
% +++ nfs_vnops.c	18 Oct 2006 01:41:14 -0000
% @@ -852,7 +869,17 @@
%  		return (error);
%  	}
% -	if ((error = cache_lookup(dvp, vpp, cnp)) && error != ENOENT) {
% +	if ((error = cache_lookup(dvp, vpp, cnp)) != 0) {
%  		struct vattr vattr;
% 
% +		if (error == ENOENT) {
% +			/* Negative cache hit.  Use it unless stale. */
% +			if (VOP_GETATTR(dvp, &vattr, cnp->cn_cred, td) == 0 &&
% +			    timespeccmp(&vattr.va_mtime, &np->n_nctime, ==))
% +				return (ENOENT);
% +
% +			cache_purge(dvp);
% +			timespecclear(&np->n_nctime);
% +			goto dorpc;
% +		}
%  		newvp = *vpp;
%  		if (!VOP_GETATTR(newvp, &vattr, cnp->cn_cred, td)
% @@ -871,4 +898,5 @@
%  		*vpp = NULLVP;
%  	}
% +dorpc:
%  	error = 0;
%  	newvp = NULLVP;
% @@ -951,4 +979,11 @@
%  nfsmout:
%  	if (error) {
% +		if (error == ENOENT && (cnp->cn_flags & MAKEENTRY) &&
% +		    cnp->cn_nameiop != CREATE) {
% +			/* Negative cache entry. */
% +			if (!timespecisset(&np->n_nctime))
% +				np->n_nctime = np->n_vattr.va_mtime;
% +			cache_enter(dvp, NULL, cnp);
% +		}
%  		if (newvp != NULLVP) {
%  			vput(newvp);
% @@ -1931,6 +1966,9 @@
%  		if (newvp)
%  			vput(newvp);
% -	} else
% +	} else {
% +		if (cnp->cn_flags & MAKEENTRY)
% +			cache_enter(dvp, newvp, cnp);
%  		*ap->a_vpp = newvp;
% +	}
%  	return (error);
%  }
% Index: nfsnode.h
% ===================================================================
% RCS file: /home/ncvs/src/sys/nfsclient/nfsnode.h,v
% retrieving revision 1.59
% diff -u -2 -r1.59 nfsnode.h
% --- nfsnode.h	13 Sep 2006 18:39:09 -0000	1.59
% +++ nfsnode.h	18 Oct 2006 00:48:44 -0000
% @@ -100,4 +100,5 @@
%  	struct timespec		n_mtime;	/* Prev modify time. */
%  	time_t			n_ctime;	/* Prev create time. */
% +	struct timespec		n_nctime;	/* Last neg cache entry (dir) */
%  	time_t			n_expiry;	/* Lease expiry time */
%  	nfsfh_t			*n_fhp;		/* NFS File Handle */

For building kernels, this gives a larger speedup than everything that
I tried short of completely dropping cto consistency, provided dotdot
caching in vfs_cache.c isn't lost.  The following benchmarks are also
with zapping of the attribute cache turned off in nfs_close() to avoid
doubled Getattr's in open() without breaking cto consistency.

Times and nfsstats are for the second run of "make depend; sync" and
"make; sync " after "make clean cleandend; sync; sleep 1" in each run,
with a RELENG_4 kernel sources, ~RELENG_5 userland and -current+ kernel,
sources and obj and /usr on nfs, unloaded network latency 100uS, ...

Before:
        12.75 real         5.19 user         1.58 sys
  Lookup Read Write Access Getattr Other   Total
   14203  548   599  21561     454    97   37462
        78.80 real        62.01 user         4.45 sys
  Lookup Read Write Create Access Fsstat Other   Total
   19543 2410  5353    442  24241   1742    14   53745

After:
        10.68 real         5.20 user         1.42 sys
  Lookup Read Write Access Getattr Other   Total
    1268  548   599  21575     454   112   24556
        76.38 real        62.00 user         4.28 sys
  Lookup Read Write Create Access Fsstat Other   Total
    4122 2410  5353    442  24222   1750    14   38313

The number of Lookups has been reduced by a factor of 11+ for make -n
and 5- for make.

With lost dotdot caching, After:
        11.02 real         5.25 user         1.40 sys
  Lookup Read Write Access Getattr Other   Total
    3031  548   599  21574     453   112   26317
        84.19 real        62.20 user         4.71 sys
  Lookup Read Write Create Access Fsstat Other   Total
   45063 2410  5353    442  24290   1750    14   79322

This does another 40000+ Lookups, much the same as Before, but ending
up with 45000+ instead of 59000+.

Of course things are slower with cold caches, but they aren't much
slower (less than what losing dotdot caching costs).  According to
vfs.cache.numcache, there are less than 2000 files to look up, so
even 4122 Lookups is a lot.  (vfs.cache.numcache was 1482 and
vfs.cache.numneg was 103 after the run that produced the above
statistics.  This includes a few other files looked up since
rebooting a few minuts earlier.  I wasn't careful about starting
from scratch without dotdot caching).

With cto consistency completely turned off, without lost dotdot caching,
After:
         7.54 real         5.17 user         1.13 sys
  Lookup Read Write Access Getattr Other   Total
    1284  548   599   2161     454   112    5158
        72.67 real        61.95 user         3.82 sys
  Lookup Read Write Create Access Fsstat Other   Total
    4799 2410  5353    442   1529   1750    14   16297

This reduces the Access count by a factor of almost 18.

Now another pessimization is more obvious -- why should building kernels
be doing all those Fsstat's?  I haven't located them for sure, but
that opendir always calls statfs(2) for the silly purpose of determining
whether it needs to do extra work to support unionfs.

For comparison:

>From now on, kernel source and obj are on a local file system.  nfs
is still on /usr, so there are a few RPCs for execing things.

Before (without lost dotdot caching, with cto consistency):
         6.39 real         5.04 user         1.08 sys
  Lookup Access Other   Total
     865    651     3    1519
        66.86 real        61.42 user         4.45 sys
  Lookup Access Other   Total
    3115   1350     1    4466

Before (without lost dotdot caching, and _without_ cto consistency):
         6.17 real         5.25 user         0.88 sys
  Other   Total
     86      86
        66.61 real        61.54 user         4.74 sys
  Other   Total
     94      94

The differences that the nfs changes make with just /usr on nfs are
measureable but tiny.  At least in my configuration.  All my executables
are statically linked, else the extra opens for cto consistency of the
shared libraries would give many more than 1350 Access RPCs.

Bruce

From owner-freebsd-fs@FreeBSD.ORG  Wed Oct 18 16:20:23 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: fs@freebsd.org
Delivered-To: freebsd-fs@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 8AF4616A40F
	for <fs@freebsd.org>; Wed, 18 Oct 2006 16:20:23 +0000 (UTC)
	(envelope-from chucklever@gmail.com)
Received: from ug-out-1314.google.com (ug-out-1314.google.com [66.249.92.173])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 0228843D55
	for <fs@freebsd.org>; Wed, 18 Oct 2006 16:20:12 +0000 (GMT)
	(envelope-from chucklever@gmail.com)
Received: by ug-out-1314.google.com with SMTP id m2so189706uge
	for <fs@freebsd.org>; Wed, 18 Oct 2006 09:20:12 -0700 (PDT)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com;
	h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
	b=EtG/n6vuRcapG4MZJ7GklWoR7PACska6J3SG2RY8lGRZd0UzCp/SBXJ+dmHEKxrTF6MQBmb0FiEKiVcQiuVj9xjFAwnLctfFTLFPz+CbtZLhnyaaBApGvcXO9FXWb7Nf35IhfraGxkL1tl/Vdga1qibI5786xFhUBKEtugT3rRY=
Received: by 10.82.109.19 with SMTP id h19mr2350949buc;
	Wed, 18 Oct 2006 09:20:11 -0700 (PDT)
Received: by 10.78.202.20 with HTTP; Wed, 18 Oct 2006 09:20:11 -0700 (PDT)
Message-ID: <76bd70e30610180920m1918e84s8e1c3b0f02de712e@mail.gmail.com>
Date: Wed, 18 Oct 2006 12:20:11 -0400
From: "Chuck Lever" <chucklever@gmail.com>
To: "Bruce Evans" <bde@zeta.org.au>
In-Reply-To: <20061018153336.E72684@delplex.bde.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <200610140725.k9E7PC37008454@repoman.freebsd.org>
	<20061015051044.GA42764@xor.obsecurity.org>
	<20061014222221.H97880@ns1.feral.com>
	<20061014222437.N4701@ns1.feral.com>
	<20061015153454.G59979@delplex.bde.org>
	<76bd70e30610150837w61689cf6ya2499d100a15c3e8@mail.gmail.com>
	<20061016164122.S63585@delplex.bde.org>
	<76bd70e30610160620x67e5d3a5j938c26744d0b9759@mail.gmail.com>
	<20061017113943.C67620@delplex.bde.org>
	<20061018153336.E72684@delplex.bde.org>
Cc: mjacob@freebsd.org, fs@freebsd.org, Kris Kennaway <kris@obsecurity.org>
Subject: Re: negative cache hits for nfs (was: cvs commit: ...)
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 18 Oct 2006 16:20:23 -0000

Hi Bruce-

On 10/18/06, Bruce Evans <bde@zeta.org.au> wrote:
> Here is a merge of some bits from NetBSD for review.  It is mostly the
> 1997 version, with updates to use timespecs instead of time_t's, but
> not updates to use changes that don't seem to be related to correctness,
> or ones less than 18 months old (if any).
>
> % Index: nfs_vnops.c
> % ===================================================================
> % RCS file: /home/ncvs/src/sys/nfsclient/nfs_vnops.c,v
> % retrieving revision 1.270
> % diff -u -2 -r1.270 nfs_vnops.c
> % --- nfs_vnops.c       14 Oct 2006 07:25:11 -0000      1.270
> % +++ nfs_vnops.c       18 Oct 2006 01:41:14 -0000
> % @@ -852,7 +869,17 @@
> %               return (error);
> %       }
> % -     if ((error = cache_lookup(dvp, vpp, cnp)) && error != ENOENT) {
> % +     if ((error = cache_lookup(dvp, vpp, cnp)) != 0) {
> %               struct vattr vattr;
> %
> % +             if (error == ENOENT) {
> % +                     /* Negative cache hit.  Use it unless stale. */
> % +                     if (VOP_GETATTR(dvp, &vattr, cnp->cn_cred, td) == 0 &&
> % +                         timespeccmp(&vattr.va_mtime, &np->n_nctime, ==))
> % +                             return (ENOENT);
> % +
> % +                     cache_purge(dvp);
> % +                     timespecclear(&np->n_nctime);
> % +                     goto dorpc;
> % +             }
> %               newvp = *vpp;
> %               if (!VOP_GETATTR(newvp, &vattr, cnp->cn_cred, td)
> % @@ -871,4 +898,5 @@
> %               *vpp = NULLVP;
> %       }
> % +dorpc:
> %       error = 0;
> %       newvp = NULLVP;
> % @@ -951,4 +979,11 @@
> %  nfsmout:
> %       if (error) {
> % +             if (error == ENOENT && (cnp->cn_flags & MAKEENTRY) &&
> % +                 cnp->cn_nameiop != CREATE) {
> % +                     /* Negative cache entry. */
> % +                     if (!timespecisset(&np->n_nctime))
> % +                             np->n_nctime = np->n_vattr.va_mtime;
> % +                     cache_enter(dvp, NULL, cnp);
> % +             }
> %               if (newvp != NULLVP) {
> %                       vput(newvp);
> % @@ -1931,6 +1966,9 @@
> %               if (newvp)
> %                       vput(newvp);
> % -     } else
> % +     } else {
> % +             if (cnp->cn_flags & MAKEENTRY)
> % +                     cache_enter(dvp, newvp, cnp);
> %               *ap->a_vpp = newvp;
> % +     }
> %       return (error);
> %  }
> % Index: nfsnode.h
> % ===================================================================
> % RCS file: /home/ncvs/src/sys/nfsclient/nfsnode.h,v
> % retrieving revision 1.59
> % diff -u -2 -r1.59 nfsnode.h
> % --- nfsnode.h 13 Sep 2006 18:39:09 -0000      1.59
> % +++ nfsnode.h 18 Oct 2006 00:48:44 -0000
> % @@ -100,4 +100,5 @@
> %       struct timespec         n_mtime;        /* Prev modify time. */
> %       time_t                  n_ctime;        /* Prev create time. */
> % +     struct timespec         n_nctime;       /* Last neg cache entry (dir) */
> %       time_t                  n_expiry;       /* Lease expiry time */
> %       nfsfh_t                 *n_fhp;         /* NFS File Handle */

This looks reasonable, but there are always tricky corner cases for
this kind of change.  The reason many of these changes haven't been
made sooner is because they usually break something else.

Now that you've verified the positive performance impact of this
change, you should start by testing this with the Connectathon test
suite (both NFSv2 and NFSv3).  Another useful test in this case is to
observe how the client behaves in the face of replaced file objects.
Use another client to remove and recreate a file that your test client
already has cached, or in this case create first after your client has
already cached the negative lookup.  Another interesting test case is
to enable READDIRPLUS on NFSv3 mounts to see if there is any strange
interaction with your patch.

Additionally, I would hold off on putting this in 6.2.  Try it in
CURRENT for a while, then MFC it after 6.3 branches.

Have you reviewed the NetBSD change logs for any fixes in this area since 1997?

> Now another pessimization is more obvious -- why should building kernels
> be doing all those Fsstat's?  I haven't located them for sure, but
> that opendir always calls statfs(2) for the silly purpose of determining
> whether it needs to do extra work to support unionfs.

I've also noticed that behavior.  Certainly the FreeBSD client can do
a better job of caching the results of Fsstat.

-- 
"We who cut mere stones must always be envisioning cathedrals"
   -- Quarry worker's creed

From owner-freebsd-fs@FreeBSD.ORG  Thu Oct 19 03:10:40 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: fs@freebsd.org
Delivered-To: freebsd-fs@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id B0D8E16A40F
	for <fs@freebsd.org>; Thu, 19 Oct 2006 03:10:40 +0000 (UTC)
	(envelope-from email2mre-bsdfs@yahoo.com)
Received: from web31807.mail.mud.yahoo.com (web31807.mail.mud.yahoo.com
	[68.142.207.70]) by mx1.FreeBSD.org (Postfix) with SMTP id 4C0DB43D55
	for <fs@freebsd.org>; Thu, 19 Oct 2006 03:10:40 +0000 (GMT)
	(envelope-from email2mre-bsdfs@yahoo.com)
Received: (qmail 84530 invoked by uid 60001); 19 Oct 2006 03:10:39 -0000
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com;
	h=Message-ID:Received:Date:From:Reply-To:Subject:To:MIME-Version:Content-Type:Content-Transfer-Encoding;
	b=4DyWHvWQnJ1w6RB4GLOR/IoFBzySnWuZt0L0DbbkWiKg6cGaRD9JqWMkBgUMm9bUxUHvswFUJvr0zyZRP0vDmAHcdksq/E3fMhjUMWx/cv41iAQ/ISBiXZBTyCjHBIs0rms3CFFvsgMGhqnIJY2Z9SGjqr+DXglLwx1JTuNPja4=
	; 
Message-ID: <20061019031039.84528.qmail@web31807.mail.mud.yahoo.com>
Received: from [216.240.30.14] by web31807.mail.mud.yahoo.com via HTTP;
	Wed, 18 Oct 2006 20:10:39 PDT
Date: Wed, 18 Oct 2006 20:10:39 -0700 (PDT)
From: Mike Eisler <email2mre-bsdfs@yahoo.com>
To: fs@freebsd.org
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Cc: 
Subject: Re: lost dotdot caching pessimizes nfs especially
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: email2mre-bsdfs@yahoo.com
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 19 Oct 2006 03:10:40 -0000


> Date: Mon, 16 Oct 2006 11:32:36 -0400 (EDT)
> From: rick@snowhite.cis.uoguelph.ca
> To: fs@freebsd.org
> Subject: Re: lost dotdot caching pessimizes nfs especially
> CC: mohan_srinivasan@yahoo.com

> the lines of my experimental NQNFS might have happenned. NFSv4 simply
> says that clients that care about cache consistency should use byte

While RFC3530does not use the term "close-to-open consistency"
(something that I'll address in the NFSv4.1 protocol), it does say:

   Furthermore, in the absence of open delegation (see the section
"Open
   Delegation") two additional rules apply.  Note that these rules are
   obeyed in practice by many NFS version 2 and version 3 clients.

   o  First, cached data present on a client must be revalidated after
      doing an OPEN.  Revalidating means that the client fetches the
      change attribute from the server, compares it with the cached
      change attribute, and if different, declares the cached data (as
      well as the cached attributes) as invalid.  This is to ensure
that
      the data for the OPENed file is still correctly reflected in the
      client's cache.  This validation must be done at least when the
      client's OPEN operation includes DENY=WRITE or BOTH thus
      terminating a period in which other clients may have had the
      opportunity to open the file with WRITE access.  Clients may
      choose to do the revalidation more often (i.e., at OPENs
      specifying DENY=NONE) to parallel the NFS version 3 protocol's
      practice for the benefit of users assuming this degree of cache
      revalidation.

      Since the change attribute is updated for data and metadata
      modifications, some client implementors may be tempted to use the
      time_modify attribute and not change to validate cached data, so
      that metadata changes do not spuriously invalidate clean data.
      The implementor is cautioned in this approach.  The change
      attribute is guaranteed to change for each update to the file,
      whereas time_modify is guaranteed to change only at the
      granularity of the time_delta attribute.  Use by the client's
data
      cache validation logic of time_modify and not change runs the
risk
      of the client incorrectly marking stale data as valid.

   o  Second, modified data must be flushed to the server before
closing
      a file OPENed for write.  This is complementary to the first
rule.
      If the data is not flushed at CLOSE, the revalidation done after
      client OPENs as file is unable to achieve its purpose.  


From owner-freebsd-fs@FreeBSD.ORG  Thu Oct 19 11:10:10 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: fs@freebsd.org
Delivered-To: freebsd-fs@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 1397F16A40F;
	Thu, 19 Oct 2006 11:10:10 +0000 (UTC) (envelope-from bde@zeta.org.au)
Received: from mailout1.pacific.net.au (mailout1-3.pacific.net.au [61.8.2.210])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 3798843D58;
	Thu, 19 Oct 2006 11:10:09 +0000 (GMT) (envelope-from bde@zeta.org.au)
Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au
	[61.8.2.162])
	by mailout1.pacific.net.au (Postfix) with ESMTP id 7516D5AFFB2;
	Thu, 19 Oct 2006 21:10:07 +1000 (EST)
Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246])
	by mailproxy1.pacific.net.au (Postfix) with ESMTP id 98D938C08;
	Thu, 19 Oct 2006 21:10:04 +1000 (EST)
Date: Thu, 19 Oct 2006 21:10:03 +1000 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@delplex.bde.org
To: Chuck Lever <chucklever@gmail.com>
In-Reply-To: <76bd70e30610180920m1918e84s8e1c3b0f02de712e@mail.gmail.com>
Message-ID: <20061019193110.I77123@delplex.bde.org>
References: <200610140725.k9E7PC37008454@repoman.freebsd.org> 
	<20061015051044.GA42764@xor.obsecurity.org>
	<20061014222221.H97880@ns1.feral.com>
	<20061014222437.N4701@ns1.feral.com>
	<20061015153454.G59979@delplex.bde.org>
	<76bd70e30610150837w61689cf6ya2499d100a15c3e8@mail.gmail.com> 
	<20061016164122.S63585@delplex.bde.org>
	<76bd70e30610160620x67e5d3a5j938c26744d0b9759@mail.gmail.com>
	<20061017113943.C67620@delplex.bde.org>
	<20061018153336.E72684@delplex.bde.org>
	<76bd70e30610180920m1918e84s8e1c3b0f02de712e@mail.gmail.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: mjacob@freebsd.org, fs@freebsd.org, Kris Kennaway <kris@obsecurity.org>
Subject: Re: negative cache hits for nfs (was: cvs commit: ...)
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 19 Oct 2006 11:10:10 -0000

On Wed, 18 Oct 2006, Chuck Lever wrote:

> On 10/18/06, Bruce Evans <bde@zeta.org.au> wrote:
>> Here is a merge of some bits from NetBSD for review.  It is mostly the
>> 1997 version, with updates to use timespecs instead of time_t's, but
>> not updates to use changes that don't seem to be related to correctness,
>> or ones less than 18 months old (if any).

>> ...
>> % +             if (error == ENOENT) {
>> % +                     /* Negative cache hit.  Use it unless stale. */
>> % +                     if (VOP_GETATTR(dvp, &vattr, cnp->cn_cred, td) == 0 
>> &&
>> % +                         timespeccmp(&vattr.va_mtime, &np->n_nctime, 
>> ==))
>> % +                             return (ENOENT);

> Now that you've verified the positive performance impact of this
> change, you should start by testing this with the Connectathon test
> suite (both NFSv2 and NFSv3).  Another useful test in this case is to
> observe how the client behaves in the face of replaced file objects.

Alas, I thought of a large problem without running any tests, ...

> Use another client to remove and recreate a file that your test client
> already has cached, or in this case create first after your client has
> already cached the negative lookup.  Another interesting test case is

... and before reading this carefully.  The directory timestamp (written
by VOP_GETTATTR() into vattr.va_mtime in the above) is normally cached.
Thus it can be stale, and then the negative cache entry can be stale
too but gets used.  With the default min dir. attr timeout of 30seconds
(BTW, why is this much larger than the min of 3 secondzs for non-dirs?),
the problem is easy to see using shell commands typed not very quickly.
I can't see how to fix this without an RPC to refresh the directory
attributes, and this defeats the point of reducing RPCs.  So the
negative cache optimization works best when combined with the no-ctoc
optimization -- then consistency for negative hits is no worse than
for positive ones.

Maybe a refresh for only the second-last component of the path would
be enough (also, only if the lookup is for open) would be good enough.
For positive dircache hits, directory attributes are normally never
refreshed until the timeout, so everything above the last component
of the path can change without us noticing and it is only the forced
refresh for the last component on open that gives us a chance to see
that the file has gone, moved or changed.  For negative dircache hits,
we want similar semantics for opens but don't have a vp for the last
component to refresh.  Second-lastness is harder to determine.

> Have you reviewed the NetBSD change logs for any fixes in this area since 
> 1997?

Some:
- use a timespec instead of a time_t for the timestamp, since we already
   do that for related timestamps
- don't get the comparison logic for the timespec backwards for 8 days
- not merged: some distributed changes involving updating timestamps more
   often since those seem to be only optimizations and I'm not sure if they
   are safe.

I now have some code for moving the attribute cache flush for
close-to-open-consistency to the start of open() and removing it
elsehwere.  This was simple except for debugging and safety-belt
parts, since there are already suitable flags for namei().  It
interacts a bit with negative caching so I'll discuss it in the
same thread now.

% Index: nfs_vnops.c
% ===================================================================
% RCS file: /home/ncvs/src/sys/nfsclient/nfs_vnops.c,v
% retrieving revision 1.270
% diff -u -2 -r1.270 nfs_vnops.c
% --- nfs_vnops.c	14 Oct 2006 07:25:11 -0000	1.270
% +++ nfs_vnops.c	19 Oct 2006 00:03:23 -0000
% @@ -1,2 +1,4 @@
% +int nfs_ctoc = 1;		/* Bogus global enable for cto consistency. */
% +
%  /*-
%   * Copyright (c) 1989, 1993
% @@ -462,5 +464,7 @@
%  		if (error == EINTR || error == EIO)
%  			return (error);
% +		ASSERT_VOP_ELOCKED(vp, "nfs_open");
%  		np->n_attrstamp = 0;
% +		np->n_attrstampco = FALSE;
%  		if (vp->v_type == VDIR)
%  			np->n_direofoffset = 0;

Locking in nfs is dubious.  We mostly hold the exclusive vnode lock and
shouldn't need much mutex locking, but it now has a lot.  The ex.vn.lock
is enough for my new flag, and the above code already holds the mutex
for np so n_attrstamp has more than enough locking.

% @@ -472,5 +476,23 @@
%  		mtx_unlock(&np->n_mtx);
%  	} else {
% -		np->n_attrstamp = 0;
% +		ASSERT_VOP_ELOCKED(vp, "nfs_open");
% +		if (nfs_ctoc && !np->n_attrstampco) {
% +			/*
% +			 * XXX this should not be reached.  It defends
% +			 * against callers of namei() neglecting to set
% +			 * ISOPEN for _all_ opens and against open()
% +			 * dropping the exclusive vnode lock between
% +			 * nfs_lookup() and here.  Note that there can
% +			 * be an nfs_create() call between nfs_open()
% +			 * and here, and that the main benefit of
% +			 * ensuring the attribute refresh earlier than
% +			 * here is for the creation case.  If the lock
% +			 * does get dropped then we should use a timeout
% +			 * here.
% +			 */
% +			printf("nfs_open: missing attrcache refresh\n");
% +			np->n_attrstamp = 0;
% +		}
% +		np->n_attrstampco = FALSE;
%  		mtx_unlock(&np->n_mtx); 
%  		error = VOP_GETATTR(vp, &vattr, ap->a_cred, ap->a_td);

This hasn't been reached lately but took a long time to get right.  When
it is reached, it is equivalent to the old code except for its printf().
It refreshes the attributes just after they have been used, but since
we refetch here there are only minor problems.

% @@ -582,11 +604,4 @@
%  		mtx_lock(&np->n_mtx);
%  	    }
% - 	    /* 
% - 	     * Invalidate the attribute cache in all cases.
% - 	     * An open is going to fetch fresh attrs any way, other procs
% - 	     * on this node that have file open will be forced to do an 
% - 	     * otw attr fetch, but this is safe.
% - 	     */
% -	    np->n_attrstamp = 0;
%  	    if (np->n_flag & NWRITEERR) {
%  		np->n_flag &= ~NWRITEERR;

I think this can be safely removed without any other changes.  We need
to refresh on open(), and scheduling this refresh on close() doesn't
really help.  It doesn't do anything useful if it is on a different
client to the open(), and on the same client it usually just gives an
unnecessary RPC.

% @@ -852,8 +867,24 @@
%  		return (error);
%  	}
% -	if ((error = cache_lookup(dvp, vpp, cnp)) && error != ENOENT) {
% +	if ((error = cache_lookup(dvp, vpp, cnp)) != 0) {
%  		struct vattr vattr;
% 
% +		if (error == ENOENT) {
% +			/* Negative cache hit.  Use it unless stale. */
% +			if (VOP_GETATTR(dvp, &vattr, cnp->cn_cred, td) == 0 &&
% +			    timespeccmp(&vattr.va_mtime, &np->n_nctime, ==))
% +				return (ENOENT);
% +
% +			cache_purge(dvp);
% +			timespecclear(&np->n_nctime);
% +			goto dorpc;
% +		}

Negative cache hit code, same as before.

%  		newvp = *vpp;
% +		if (nfs_ctoc &&
% +		    (flags & (ISLASTCN | ISOPEN)) == (ISLASTCN | ISOPEN)) {
% +			ASSERT_VOP_ELOCKED(vp, "nfs_lookup");
% +			VTONFS(newvp)->n_attrstamp = 0;
% +			VTONFS(newvp)->n_attrstampco = TRUE;
% +		}
%  		if (!VOP_GETATTR(newvp, &vattr, cnp->cn_cred, td)
%  		 && vattr.va_ctime.tv_sec == VTONFS(newvp)->n_ctime) {

This is for a positve vfs cache hit.  Lookups for open() are easily
detected.  I didn't bother locking np for the access to n_attrstamp.
n_attrstampco records that we refreshed the attributes so nfs_open()
shouldn't do it again.  This is mainly a safety belt.  vnode_open_cred()
seems to be exclusively locked almost throughout, so we can safely
pass n_attrstampco to nfs_open() like this, but with exclusive locking
we can almost guarantee that nfs_open() doesn't need to refresh.

% @@ -871,4 +902,5 @@
%  		*vpp = NULLVP;
%  	}
% +dorpc:
%  	error = 0;
%  	newvp = NULLVP;

Negative cache hit code, same as before.

% @@ -935,4 +967,10 @@
%  		newvp = NFSTOV(np);
%  	}
% +	if (nfs_ctoc &&
% +	    (flags & (ISLASTCN | ISOPEN)) == (ISLASTCN | ISOPEN)) {
% +		ASSERT_VOP_ELOCKED(vp, "nfs_lookup");
% +		VTONFS(newvp)->n_attrstamp = 0;
% +		VTONFS(newvp)->n_attrstampco = TRUE;
% +	}
%  	if (v3) {
%  		nfsm_postop_attr(newvp, attrflag);

This is for a remote lookup.  I think the attributes haven't been fetched
yet so clearing n_attrstamp is not needed.

% @@ -951,4 +989,11 @@
%  nfsmout:
%  	if (error) {
% +		if (error == ENOENT && (flags & MAKEENTRY) &&
% +		    cnp->cn_nameiop != CREATE) {
% +			/* Negative cache entry. */
% +			if (!timespecisset(&np->n_nctime))
% +				np->n_nctime = np->n_vattr.va_mtime;
% +			cache_enter(dvp, NULL, cnp);
% +		}
%  		if (newvp != NULLVP) {
%  			vput(newvp);

Negative cache hit code, same as before.

% @@ -1440,4 +1485,12 @@
%  			cache_enter(dvp, newvp, cnp);
%  		*ap->a_vpp = newvp;
% +		if (nfs_ctoc) {
% +			/*
% +			 * XXX this assumes that we will be followed by
% +			 * nfs_open() immediately.
% +			 */
% +			ASSERT_VOP_ELOCKED(vp, "nfs_create");
% +			VTONFS(newvp)->n_attrstampco = TRUE;
% +		}
%  	}
%  	mtx_lock(&(VTONFS(dvp))->n_mtx);

This is in nfs_create().  Creation is interesting.  nfs_lookup() fails,
and if this was due to a negative cache hit the directory timestamp
had better be right.  Then vn_open_cred() calls here.  We don't have
a vp until near here so we can't set n_attrstampco until here.
Optimization of creation is the main possible optimization for the
ctoc case.  When we have just created the file, it is silly for
nfs_open() to force an attribute refresh a few microseconds later.  (I
haven't checked whether the file's cached attributes are ones we just
wrote or ones we just refreshed after creation, and hope that whatever
they are at this point is good enough.)  This optimization reduces the
number of Access RPCs for my kernel build benchmark by about 10% (from
~24000 to ~22000).  Many of the negative cache hits for this benchmark
are probably caused by creations, since the benchmark starts with the
"make clean; ...; make depend" so the "make depend" part creates lots
of negative cache hits for the files just removed.  However, an incorrect
negative cache hit seems most dangerous when the file is being created,
so parts of the optimization are invalid -- the current and previous
implementations of ctoc only refresh the attributes after possibly
clobbering the old ones.

% @@ -1931,6 +1984,9 @@
%  		if (newvp)
%  			vput(newvp);
% -	} else
% +	} else {
% +		if (cnp->cn_flags & MAKEENTRY)
% +			cache_enter(dvp, newvp, cnp);
%  		*ap->a_vpp = newvp;
% +	}
%  	return (error);
%  }

Part of negative cache hit code, same as before, but it is for future
positve cache hits (for the directory crwated by mkdir()).

% Index: nfsnode.h
% ===================================================================
% RCS file: /home/ncvs/src/sys/nfsclient/nfsnode.h,v
% retrieving revision 1.59
% diff -u -2 -r1.59 nfsnode.h
% --- nfsnode.h	13 Sep 2006 18:39:09 -0000	1.59
% +++ nfsnode.h	18 Oct 2006 10:09:19 -0000
% @@ -95,4 +95,5 @@
%  	struct vattr		n_vattr;	/* Vnode attribute cache */
%  	time_t			n_attrstamp;	/* Attr. cache timestamp */
% +	int			n_attrstampco;	/* n_attrs. cleared on open */
%  	u_int32_t		n_mode;		/* ACCESS mode cache */
%  	uid_t			n_modeuid;	/* credentials having mode */
% @@ -100,4 +101,5 @@
%  	struct timespec		n_mtime;	/* Prev modify time. */
%  	time_t			n_ctime;	/* Prev create time. */
% +	struct timespec		n_nctime;	/* Last neg cache entry (dir) */
%  	time_t			n_expiry;	/* Lease expiry time */
%  	nfsfh_t			*n_fhp;		/* NFS File Handle */

Bruce

From owner-freebsd-fs@FreeBSD.ORG  Thu Oct 19 15:28:19 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: fs@freebsd.org
Delivered-To: freebsd-fs@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 8882E16A403
	for <fs@freebsd.org>; Thu, 19 Oct 2006 15:28:19 +0000 (UTC)
	(envelope-from rick@snowhite.cis.uoguelph.ca)
Received: from dargo.cs.uoguelph.ca (dargo.cs.uoguelph.ca [131.104.94.197])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 887B843D66
	for <fs@freebsd.org>; Thu, 19 Oct 2006 15:28:10 +0000 (GMT)
	(envelope-from rick@snowhite.cis.uoguelph.ca)
Received: from snowhite.cis.uoguelph.ca (snowhite.cis.uoguelph.ca
	[131.104.48.1])
	by dargo.cs.uoguelph.ca (8.13.1/8.13.1) with ESMTP id k9JFS8Sq014581;
	Thu, 19 Oct 2006 11:28:08 -0400
Received: (from rick@localhost)
	by snowhite.cis.uoguelph.ca (8.9.3/8.9.3) id LAA03929;
	Thu, 19 Oct 2006 11:28:20 -0400 (EDT)
Date: Thu, 19 Oct 2006 11:28:20 -0400 (EDT)
From: rick@snowhite.cis.uoguelph.ca
Message-Id: <200610191528.LAA03929@snowhite.cis.uoguelph.ca>
To: fs@freebsd.org
X-Scanned-By: MIMEDefang 2.57 on 131.104.94.197
Cc: 
Subject: Re: lost dotdot caching pessimizes nfs especially
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 19 Oct 2006 15:28:19 -0000

[Mike wrote]
> While RFC3530does not use the term "close-to-open consistency"
> (something that I'll address in the NFSv4.1 protocol), it does say:
> 
>    Furthermore, in the absence of open delegation (see the section
> "Open
>    Delegation") two additional rules apply.  Note that these rules are
>    obeyed in practice by many NFS version 2 and version 3 clients.
> 
>    o  First, cached data present on a client must be revalidated after
>       doing an OPEN.  Revalidating means that the client fetches the
>       change attribute from the server, compares it with the cached
>       change attribute, and if different, declares the cached data (as
>       well as the cached attributes) as invalid.  This is to ensure
> that
>       the data for the OPENed file is still correctly reflected in the
>       client's cache.  This validation must be done at least when the
>       client's OPEN operation includes DENY=WRITE or BOTH thus
>       terminating a period in which other clients may have had the
>       opportunity to open the file with WRITE access.  Clients may
>       choose to do the revalidation more often (i.e., at OPENs
>       specifying DENY=NONE) to parallel the NFS version 3 protocol's
>       practice for the benefit of users assuming this degree of cache
>       revalidation.
> 
>       Since the change attribute is updated for data and metadata
>       modifications, some client implementors may be tempted to use the
>       time_modify attribute and not change to validate cached data, so
>       that metadata changes do not spuriously invalidate clean data.
>       The implementor is cautioned in this approach.  The change
>       attribute is guaranteed to change for each update to the file,
>       whereas time_modify is guaranteed to change only at the
>       granularity of the time_delta attribute.  Use by the client's
> data
>       cache validation logic of time_modify and not change runs the
> risk
>       of the client incorrectly marking stale data as valid.
> 
>    o  Second, modified data must be flushed to the server before
> closing
>       a file OPENed for write.  This is complementary to the first
> rule.
>       If the data is not flushed at CLOSE, the revalidation done after
>       client OPENs as file is unable to achieve its purpose.  

Oops, yes, this is in Sec. 9.3. I didn't spot it when I was emailing a few
days ago, but do have code for it in my v4 client (so I must have read it
at some point:-). (fyi, the RFC is only 275 pages)

I'd argue that it contradicts the section I quoted in the last message I
posted, but does spell it out in enough detail that it can't be misinterpreted,
whereas the sections I quoted were kinda vague.

I'll simply re-iterate that this isn't a problem if clients are using
delegations efficiently, for v4, that is. (At this point, none of the v4
clients I am aware of are using delegations to avoid writes on close and
checking against the server upon open, but hopefully it's being worked on.
Since all the interest is in pNFS these days, I'm not so sure, but...)

rick