From owner-freebsd-fs@FreeBSD.ORG Wed Jan 18 22:07:26 2012 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 02CA4106566B; Wed, 18 Jan 2012 22:07:26 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id BB7338FC08; Wed, 18 Jan 2012 22:07:25 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [96.47.65.170]) by cyrus.watson.org (Postfix) with ESMTPSA id 3D06746B2C; Wed, 18 Jan 2012 17:07:25 -0500 (EST) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 9B0C1B91C; Wed, 18 Jan 2012 17:07:22 -0500 (EST) From: John Baldwin To: Rick Macklem Date: Wed, 18 Jan 2012 17:07:21 -0500 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p10; KDE/4.5.5; amd64; ; ) MIME-Version: 1.0 Content-Type: Text/Plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Message-Id: <201201181707.21293.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Wed, 18 Jan 2012 17:07:22 -0500 (EST) Cc: Peter Wemm , fs@freebsd.org Subject: Race in NFS lookup can result in stale namecache entries X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Jan 2012 22:07:26 -0000 I recently encountered a problem at work with a very stale name cache entry. A directory was renamed on one NFS client and a new directory was created with the same name. On another NFS client, both the old and new pathnames resolved to the old filehandle and stayed that way for days. It was only fixed by touching the parent directory which forced the "wrong" NFS client to flush name cache entries for the directory and repopulate it via LOOKUPs. I eventually figured out the race condition that triggered this and was able to reproduce it. (I had to hack up the NFS client to do some explicit sleeps to order the steps right to trigger the race however. It seems to be very rare in practice.) The root cause for the stale entry being trusted is that each per-vnode nfsnode structure has a single 'n_ctime' timestamp used to validate positive name cache entries. However, if there are multiple entries for a single vnode, they were all sharing a single timestamp. Assume you have three threads spread across two NFS clients (R1 on the client doing the directory rename, and T1 and T2 on the "victim" NFS client), and assume that thread S1 represents the NFS server and the order it completes requests. Also, assume that $D represents a parent directory where the rename occurs and that the original directory is named "foo". Finally, assume that F1 is the original directory's filehandle, and F2 is the new filehandle. Time increases as the graph goes down: R1 T1 T2 S1 ------------- ------------- ------------- --------------- LOOKUP "$D/foo" (1) REPLY (1) "foo" F1 start reply processing up to extracting post-op attrs RENAME "$D/foo" "$D/baz" (2) REPLY (2) GETATTR $D during lookup due to expiry (3) REPLY (3) flush $D name cache entries due to updated timestamp LOOKUP "$D/baz" (4) REPLY (4) "baz" F1 process reply, including post-op attrs that set F1's cached attrs to a ctime post RENAME resume reply finish reply processing processing including including setting F1's setting F1's n_ctime and n_ctime and adding cache adding cache entry entry At the end of this, the "victim" NFS client now has two name cache entries for "$D/foo" and "$D/baz" that point to the F1 filehandle. The n_ctime used to validate these name cache hits in nfs_lookup() is already updated to post RENAME, so nfs_lookup() will trust these entries until a future change to F1's i-node. Further, "$D"'s local attribute cache already reflects the updated ctime post RENAME, so it will not flush it's name cache entries until a future change to the directory. The root problem is that the name cache entry for "foo" was added using the wrong ctime. It really should be using the F1 attributes in the post-op attributes from the LOOKUP reply, not from F1's local attribute cache. However, just changing that is not sufficient. There are still races with the calls to cache_enter() and updating n_ctime. What I concluded is that it would really be far simpler and more obvious if the cached timestamps were stored in the namecache entry directly rather than having multiple name cache entries validated by shared state in the nfsnode. This does mean allowing the name cache to hold some filesystem-specific state. However, I felt this was much cleaner than adding a lot more complexity to nfs_lookup(). Also, this turns out to be fairly non-invasive to implement since nfs_lookup() calls cache_lookup() directly, but other filesystems only call it indirectly via vfs_cache_lookup(). I considered letting filesystems store a void * cookie in the name cache entry and having them provide a destructor, etc. However, that would require extra allocations for NFS lookups. Instead, I just adjusted the name cache API to explicitly allow the filesystem to store a single timestamp in a name cache entry by adding a new 'cache_enter_time()' that accepts a struct timespec that is copied into the entry. 'cache_enter_time()' also saves the current value of 'ticks' in the entry. 'cache_lookup()' is modified to add two new arguments used to return the timespec and ticks value used for a namecache entry when a hit in the cache occurs. One wrinkle with this is that the name cache does not create actual entries for ".", and thus it would not store any timestamps for those lookups. To fix this I changed the NFS client to explicitly fast-path lookups of "." by always returning the current directory as setup by cache_lookup() and never bothering to do a LOOKUP or check for stale attributes in that case. The current patch against 8 is at http://www.FreeBSD.org/~jhb/patches/nfs_lookup.patch It includes ABI and API compat shims so that it is suitable for merging to stable branches. For HEAD I would likely retire the cache_lookup_times() name and just change all the callers of cache_lookup() (there are only a handful, and nwfs and smbfs might benefit from this functionality anyway). -- John Baldwin