From owner-freebsd-fs@FreeBSD.ORG Wed Jan 18 23:52:59 2012 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0670A106566C; Wed, 18 Jan 2012 23:52:59 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-annu.mail.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 5ECEE8FC08; Wed, 18 Jan 2012 23:52:58 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEAARaF0+DaFvO/2dsb2JhbABEFoRup0SCB4FyAQEFI1YbDgoCAg0ZAlkGHId5pzaRa4EviAkBAQgJFAkBAQECAQEMBQQRBQEGAQEGAQUXFQECAQEIAQEBCQYCBgEDAQEEAgEBAwEOBAEDAgIDBA0BAQIBBAIBAgEBBQUEAgEDAQQBBQICAQECAQEBBQYBAQEHAQECBgICAgEEAggDgUAaAgcBAQIDDQECAwEBAwIDAgMEAQSCMYEWBIg7jFmSZg X-IronPort-AV: E=Sophos;i="4.71,532,1320642000"; d="scan'208";a="152736736" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-annu-pri.mail.uoguelph.ca with ESMTP; 18 Jan 2012 18:52:57 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 41BB3B3F24; Wed, 18 Jan 2012 18:52:57 -0500 (EST) Date: Wed, 18 Jan 2012 18:52:57 -0500 (EST) From: Rick Macklem To: John Baldwin Message-ID: <1143916684.516944.1326930777192.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <201201181707.21293.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.202] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: Rick Macklem , fs@freebsd.org, Peter Wemm Subject: Re: Race in NFS lookup can result in stale namecache entries X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Jan 2012 23:52:59 -0000 John Baldwin wrote: > I recently encountered a problem at work with a very stale name cache > entry. A directory was renamed on one NFS client and a new directory > was created with the same name. On another NFS client, both the old > and new pathnames resolved to the old filehandle and stayed that way > for days. It was only fixed by touching the parent directory which > forced the "wrong" NFS client to flush name cache entries for the > directory and repopulate it via LOOKUPs. I eventually figured out the > race condition that triggered this and was able to reproduce it. (I > had to hack up the NFS client to do some explicit sleeps to order the > steps right to trigger the race however. It seems to be very rare in > practice.) The root cause for the stale entry being trusted is that > each per-vnode nfsnode structure has a single 'n_ctime' timestamp used > to validate positive name cache entries. However, if there are > multiple entries for a single vnode, they were all sharing a single > timestamp. Assume you have three threads spread across two NFS > clients (R1 on the client doing the directory rename, and T1 and T2 on > the "victim" NFS client), and assume that thread S1 represents the NFS > server and the order it completes requests. Also, assume that $D > represents a parent directory where the rename occurs and that the > original directory is named "foo". Finally, assume that F1 is the > original directory's filehandle, and F2 is the new filehandle. > Time increases as the graph goes down: > > R1 T1 T2 S1 > ------------- ------------- ------------- --------------- > LOOKUP "$D/foo" > (1) > REPLY (1) "foo" F1 > start reply > processing > up to > extracting > post-op attrs > RENAME "$D/foo" > "$D/baz" (2) > REPLY (2) > GETATTR $D > during lookup > due to expiry > (3) > REPLY (3) > flush $D name > cache entries > due to updated > timestamp > LOOKUP "$D/baz" > (4) > REPLY (4) "baz" F1 > process reply, > including > post-op attrs > that set F1's > cached attrs > to a ctime > post RENAME > > resume reply finish reply > processing processing > including including > setting F1's setting F1's > n_ctime and n_ctime and > adding cache adding cache > entry entry > > At the end of this, the "victim" NFS client now has two name cache > entries for "$D/foo" and "$D/baz" that point to the F1 filehandle. > The n_ctime used to validate these name cache hits in nfs_lookup() is > already updated to post RENAME, so nfs_lookup() will trust these > entries until a future change to F1's i-node. Further, "$D"'s local > attribute cache already reflects the updated ctime post RENAME, so it > will not flush it's name cache entries until a future change to the > directory. > > The root problem is that the name cache entry for "foo" was added > using the wrong ctime. It really should be using the F1 attributes in > the post-op attributes from the LOOKUP reply, not from F1's local > attribute cache. However, just changing that is not sufficient. > There are still races with the calls to cache_enter() and updating > n_ctime. > > What I concluded is that it would really be far simpler and more > obvious if the cached timestamps were stored in the namecache entry > directly rather than having multiple name cache entries validated by > shared state in the nfsnode. This does mean allowing the name cache > to hold some filesystem-specific state. However, I felt this was much > cleaner than adding a lot more complexity to nfs_lookup(). Also, this > turns out to be fairly non-invasive to implement since nfs_lookup() > calls cache_lookup() directly, but other filesystems only call it > indirectly via vfs_cache_lookup(). I considered letting filesystems > store a void * cookie in the name cache entry and having them provide > a destructor, etc. However, that would require extra allocations for > NFS lookups. Instead, I just adjusted the name cache API to > explicitly allow the filesystem to store a single timestamp in a name > cache entry by adding a new 'cache_enter_time()' that accepts a struct > timespec that is copied into the entry. 'cache_enter_time()' also > saves the current value of 'ticks' in the entry. 'cache_lookup()' is > modified to add two new arguments used to return the timespec and > ticks value used for a namecache entry when a hit in the cache occurs. > > One wrinkle with this is that the name cache does not create actual > entries for ".", and thus it would not store any timestamps for those > lookups. To fix this I changed the NFS client to explicitly fast-path > lookups of "." by always returning the current directory as setup by > cache_lookup() and never bothering to do a LOOKUP or check for stale > attributes in that case. > > The current patch against 8 is at > http://www.FreeBSD.org/~jhb/patches/nfs_lookup.patch > > It includes ABI and API compat shims so that it is suitable for > merging to stable branches. For HEAD I would likely retire the > cache_lookup_times() name and just change all the callers of > cache_lookup() (there are only a handful, and nwfs and smbfs might > benefit from this functionality anyway). > It sounds good to me, although I haven`t yet looked at the patch or thought about it much. However, (and I think you`re already aware of this) given time clock resolution etc, as soon as multiple clients start manipulating the contents of a directory concurrently there is going to be a possibility of having a stale name cache entry. I think you`ve already mentioned this, but having a timeout on positive name cache entries like we did for negative name cache entries, will at least limit the effect of these. For negative name cache entries, the little test I did showed that name cache hit was almost as good for a 30-60sec timeout as an infinite timeout. I suspect something similar might be true for positive name cache entries and it will be easy to do some measurements once it is coded. If you would like, I can code up a positive name cache timeout similar to what you did for the negative name cache entries or would you prefer to do so? rick