From owner-freebsd-hackers@FreeBSD.ORG  Fri Jun 20 02:07:32 2003
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id B595437B401
	for <freebsd-hackers@freebsd.org>;
	Fri, 20 Jun 2003 02:07:32 -0700 (PDT)
Received: from heron.mail.pas.earthlink.net (heron.mail.pas.earthlink.net
	[207.217.120.189])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 1FCF343F3F
	for <freebsd-hackers@freebsd.org>;
	Fri, 20 Jun 2003 02:07:32 -0700 (PDT)
	(envelope-from tlambert2@mindspring.com)
Received: from user-uinj93o.dialup.mindspring.com ([165.121.164.120]
	helo=mindspring.com)
	by heron.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128)
	(Exim 3.33 #1)	id 19THrk-0001Fg-00; Fri, 20 Jun 2003 02:07:29 -0700
Message-ID: <3EF2CDF0.6014ACB6@mindspring.com>
Date: Fri, 20 Jun 2003 02:03:44 -0700
From: Terry Lambert <tlambert2@mindspring.com>
X-Mailer: Mozilla 4.79 [en] (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Andrey Alekseyev <uitm@blackflag.ru>
References: <200306200705.LAA00432@slt.oz>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a44034bf79371c0c1e247944e9f4ddb226350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c
cc: freebsd-hackers@freebsd.org
Subject: Re: open() and ESTALE error
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 20 Jun 2003 09:07:33 -0000

Andrey Alekseyev wrote:
> Terry,
> 
> Thanks much for you comments, but see below.
> 
> > The real problem here is that you know you did an operation
> > on the file which would break the name/nfsnode relationship,
> > but did not flush the cached name and nfsnode data.
> 
> nfs_request() actually calls cache_purge() on ESTALE, and vn_open()
> frees vnode with vput() if a lookup was successful but there were
> an error from the underlying filesystem (like ESTALE resulting from
> nfs_request() which is eventually called from VOP_ACCESS or VOP_OPEN).

The place to correct this is probably the underlying FS.  I'd
argue that getting ESTALE is a poke with a sharp stick that
makes this more likely to happen.  ;^).


> > A more correct solution would resync the nfsnode.
> 
> I think this is exactly what happens :) Actually, I believe, I'm just
> getting another namecache entry with another vnode/nfsnode/file handle.

You can't have this for other reasons; specifically, if you have
the file open at th time of the rename, and it becomes a ".#nfs..."
file (or whatever) on the server.


> > The main problem with your solution is that it doesn't work
> > in the case that you don't know the name of the remote file
> > (in which case, all you really have is a stale file handle,
> > with no way to unstale it).
> 
> I think, in this case (if the file was rm'd on the server), I'll just
> get ENOENT from the second vn_open() attempt, which would be more
> than appropriate. A real drawback is that for a stale "current"
> directory it'll take another lookup to detect "true" ESTALE.

This is more a problem is the ESTALE handling.  In the case where
you are doing a lookup, and get an ESTALE, it's probably correct
to translateit based on the semantics you are expecting in the
upper layer.

The problem here is that a given VOP can be called from multiple
system call implementations, and a given system call implementation
can call multiple VOPs to implement its functionality.  This means
that you'd have to model the system call later state machine within
the filesystem itselt in order to return the "expected" error for
every possible case.  This isn't a reasonable thing to expect.


> > This would fix a lot more cases than the single failure you
> > are fixing.
> 
> Actually, as I said, I played with different parts of the code to solve
> this (including, nfs_open(), nfs_access(), nfs_lookup() and vn_open())
> only to find the previously mentioned solution to be the simpliest and
> most suitable for all situations (for me!)  :)

Don Lewis has a good posting in response to you; you will likely
have read it before you read this response, so fee free to not
respond directly to this point.

Don points out that Solaris tries to fix this via the "noac" mount
option for client NFS.

What his quote:

           noac  Suppress data and attribute  caching.  The  data
                 caching  that is suppressed is the write-behind.
                 The local page cache is  still  maintained,  but
                 data  copied  into  it is immediately written to
                 the server.

hints at, but doesn't come right out and say, is that the cache
is flushed on write operations ("the data caching that is suppressed
is write-behind").  What this means practically, in terms of the
implementation of the NFS client code, is that everywhere there is
a client triggered change of state for metadata in the server that
could result in an ESTALE, the client cached information is flushed
out and has to be reacquired.

If this were happening in the NFS client today, then your rename
would not end up giving you an ESTALE, because the stale data would
have been discarded.

I'd also like to point out the following case:

	{ A, B }
fd1 open on B
rename B -> C
rename A -> B

In this case, the FH in question would still work for B.  What would
happen if it were:

	{ A, B, C }
fd1 open on B
fd2 open on C
rename B -> C
rename A -> B

?  With your patch, I think we would potentially convert fd2 to point
to B whien it really *should* be "ESTALE", which is wrong (think in
terms of 2 or more clients doing the operations).

-- Terry