From owner-freebsd-hackers@FreeBSD.ORG Sun Jun 29 18:23:40 2003 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3289C37B401 for ; Sun, 29 Jun 2003 18:23:40 -0700 (PDT) Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4F3BC43FE1 for ; Sun, 29 Jun 2003 18:23:39 -0700 (PDT) (envelope-from truckman@FreeBSD.org) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.12.9/8.12.9) with ESMTP id h5U1NKM7088025; Sun, 29 Jun 2003 18:23:32 -0700 (PDT) (envelope-from truckman@FreeBSD.org) Message-Id: <200306300123.h5U1NKM7088025@gw.catspoiler.org> Date: Sun, 29 Jun 2003 18:23:20 -0700 (PDT) From: Don Lewis To: uitm@blackflag.ru In-Reply-To: <200306221153.PAA00759@slt.oz> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii cc: freebsd-hackers@FreeBSD.org Subject: Re: open() and ESTALE error X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 30 Jun 2003 01:23:40 -0000 On 22 Jun, Andrey Alekseyev wrote: > Don, > >> When a file is renamed on the server, its file handle remains valid. > > Actually I was wrong in my assumption on how names are purged from the > namecache. And I didn't mean an operation with a file opened on the client. > And it actually happens that this sequence of commands will get you ENOENT > (not ESTALE) on the client because of a new lookup in #4: > > 1. server: echo "1111" > 1 > 2. client: cat 1 > 3. server: mv 1 2 > 4. client: cat 1 <--- ENOENT here That's what it is supposed to do, but my testing would seem to indicate that step 4 could return the file contents for an extended period of time after the file was renamed on the server. > Name cache can be purged by nfs_lookup(), if the latter finds that the > capability numbers doesn't match. In this case, nfs_lookup() will send a > new "lookup" RPC request to the server. Name cache can also be purged from > getnewvnode() and vclean(). Which code does that for the above scenario > it's quite obscure to me. Yes, my knowledge is limited :) The vpid == newvp->v_id test in nfs_lookup() just detects if the vnode that the cache entry pointed to was recycled for another use while it was on the free list. It doesn't detect whether the inode on the server was recycled. When I was thinking about this problem, the solution I came up with was a lot like the if (!VOP_GETATTR(newvp, &vattr, cnp->cn_cred, td) && vattr.va_ctime.tv_sec == VTONFS(newvp)->n_ctime) code fragment, but I would have done the ctime check on both the target and the parent directory and only ignored the cache entry if both ctimes had been updated. Checking only the target should be more conservative, though it would be slower because there would be more cases where the client would have to do the RPC call. If the file on the server associated with the cached entry on the client is renamed on the server, its file handle will remain valid, but its ctime will be updated, so VOP_GETATTR() will succeed, but the ctime check should be activated and the cache entry purged. If the file on the server is unlinked or another file mv'ed on top of it, its file handle should no longer be valid, so the VOP_GETATTR() call should fail, which should cause the cache entry to be purged and a new lookup RPC should be done. What I find interesting is that in order for for open() to fail with the ESTALE error, the cache entry must be used, which means that this VOP_GETATTR() call must be succeeding, but for some reason another VOP call after namei() returns is failing with ESTALE. >> Here's the output of the script: >> >> #!/bin/sh -v >> rm -f file1 file2 >> ssh -n mousie rm -f file1 file2 >> echo foo > file1 >> echo bar > file2 >> ssh -n mousie cat file1 >> foo >> ssh -n mousie cat file2 >> bar >> tail -f file1 & >> sleep 1 >> foo >> cat file1 >> foo >> cat file2 >> bar >> ssh -n mousie 'mv file1 tmpfile; mv file2 file1; mv tmpfile file2' >> cat file1 >> bar >> cat file2 >> foo >> echo baz >> file2 >> sleep 1 >> baz >> kill $! >> Terminated >> ssh -n mousie cat file1 >> bar >> ssh -n mousie cat file2 >> foo >> baz >> >> Notice that immediately after the files are swapped on the server, the >> cat commands on the client are able to immediately detect that the files >> have been interchanged and they open the correct files. The tail >> command shows that the original handle for file1 remains valid after the >> rename operations and when more data is written to file2 after the >> interchange, the data is appended to the file that was formerly file1. > > By the way, what were the values of acregmin/acregmax/acdirmin/acdirmax and > also the value of vfs.nfs.access_cache_timeout in your tests? I'm using the the default values for acregmin/acregmax/acdirmin/acdirmax. % sysctl vfs.nfs.access_cache_timeout vfs.nfs.access_cache_timeout: 2 > I believe, the results of your test patterns heavily depend on the NFS > attributes cache tunables (which happen to affect all cycles of NFS > operation) and on the command execution timing as well. Moreover, I'm > suspect that all this is badly linked with the type and sequence of > operations on both the server and the client. Recall, I was about to fix > just *one* common scenario :) Some of my test cases waited for 120 seconds after the rename on the server before attempting access from the client, which should be enough time for the attribute cache to time out. > With different values of acmin/acmax and access_cache_timeout, and manual > operations I was able to achieve the result you consider as "proper" above > and also, the "wrong" effect that you described below. > >> And its output: >> >> #!/bin/sh -v >> rm -f file1 file2 >> ssh -n mousie rm -f file1 file2 >> echo foo > file1 >> echo bar > file2 >> ssh -n mousie cat file1 >> foo >> ssh -n mousie cat file2 >> bar >> sleep 1 >> cat file1 >> foo >> cat file2 >> bar >> ssh -n mousie 'mv file1 file2' >> cat file2 >> foo >> cat file1 >> cat: file1: No such file or directory >> >> Even though file2 was unlinked and replaced by file1 on the server, the >> client immediately notices the change and is able to open the proper >> file. > > My tests always eventually produce ESTALE for file2 here. However, I suspect > their must be configurations where I won't get ESTALE. > >> Conclusion: relying on seeing an ESTALE error to retry is insufficient. >> Depending on how files are manipulated, open() may successfully return a >> descriptor for the wrong file and even enable the contents of that file >> to be overwritten. The namei()/lookup() code is broken and that's what >> needs to be fixed. > > I don't think it's namei()/lookup() which is broken. I'm afraid, the name > and attribute caching logic is somewhat far from ideal. Namecache routines > seem to work fine, they just do actual parsing/lookup of a pathname. Other > functions manipulate with the cached names basing on their understanding > of the cache validity (both namecache and cached dir/file attributes). I think the main problem is namei()/lookup(). They shouldn't be returning a vnode that is associated with a file handle that points to a different or non-existent file on the server if the name to handle association has been invalid for a long period of time. While it's not possible to totally enforce coherency, we should be able to do a lot better. > I've also done a number of tcpdump's for different test patterns and I > believe, what happens with the cached vnode may depend on the results of > the "access" RPC request to the server. That may be an important clue. The access cache may be properly working, but the attribute cache timeout may be broken. > As I said before, I was not going to fix all the NFS inefficiencies related > to heavily shared file environments. However, I still believe that > open-retry-on-ESTALE *may* help people to avoid certain erratic conditions. > At least, I think that having this functionality switchable with an > additional sysctl variable, *may* help lots of people in the black art of > tuning NFS caching. As there are no exact descriptions on how > all of this behaves, people usually have to experiment with their own > certain environments. > > Also, I agree it's not the "fix" of everything. And I didn't even mention > I want this to be integrated in the source :) I don't think we can totally fix the problem, but I would like to see the source fixed so that people don't have to patch their applications or their kernel for common usage patterns. > Actually, I know that it works for what I've been fixing locally and just > asked for technical comments about possible "vnode leakage" and nameidata > initialization which nobody provided yet ;-P I think you're probably ok on the vnode side, but one problem might be the flags in the struct nameidata. The lookup code tends to fiddle with them. I was also concerned about leaking the cn_pnbuf buffer, but it looks like it may not get allocated or may get freed in the error case, since kern_open() don't call NDFREE(&nd, NDF_ONLY_PNBUF) if namei() fails. > I appreciate *very much* all of the answers, though. Definitely, a food for > thought, but I'm a little bit tired of this issue already :) > > Thanks again for your efforts.