From owner-freebsd-fs@FreeBSD.ORG  Wed Jan 18 22:07:26 2012
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 02CA4106566B;
	Wed, 18 Jan 2012 22:07:26 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id BB7338FC08;
	Wed, 18 Jan 2012 22:07:25 +0000 (UTC)
Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [96.47.65.170])
	by cyrus.watson.org (Postfix) with ESMTPSA id 3D06746B2C;
	Wed, 18 Jan 2012 17:07:25 -0500 (EST)
Received: from jhbbsd.localnet (unknown [209.249.190.124])
	by bigwig.baldwin.cx (Postfix) with ESMTPSA id 9B0C1B91C;
	Wed, 18 Jan 2012 17:07:22 -0500 (EST)
From: John Baldwin <jhb@freebsd.org>
To: Rick Macklem <rmacklem@freebsd.org>
Date: Wed, 18 Jan 2012 17:07:21 -0500
User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p10; KDE/4.5.5; amd64; ; )
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="us-ascii"
Content-Transfer-Encoding: 7bit
Message-Id: <201201181707.21293.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
	(bigwig.baldwin.cx); Wed, 18 Jan 2012 17:07:22 -0500 (EST)
Cc: Peter Wemm <peter@freebsd.org>, fs@freebsd.org
Subject: Race in NFS lookup can result in stale namecache entries
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 18 Jan 2012 22:07:26 -0000

I recently encountered a problem at work with a very stale name cache
entry.  A directory was renamed on one NFS client and a new directory
was created with the same name.  On another NFS client, both the old
and new pathnames resolved to the old filehandle and stayed that way
for days.  It was only fixed by touching the parent directory which
forced the "wrong" NFS client to flush name cache entries for the
directory and repopulate it via LOOKUPs.  I eventually figured out the
race condition that triggered this and was able to reproduce it.  (I
had to hack up the NFS client to do some explicit sleeps to order the
steps right to trigger the race however.  It seems to be very rare in
practice.)  The root cause for the stale entry being trusted is that
each per-vnode nfsnode structure has a single 'n_ctime' timestamp used
to validate positive name cache entries.  However, if there are
multiple entries for a single vnode, they were all sharing a single
timestamp.  Assume you have three threads spread across two NFS
clients (R1 on the client doing the directory rename, and T1 and T2 on
the "victim" NFS client), and assume that thread S1 represents the NFS
server and the order it completes requests.  Also, assume that $D
represents a parent directory where the rename occurs and that the
original directory is named "foo".  Finally, assume that F1 is the
original directory's filehandle, and F2 is the new filehandle.
Time increases as the graph goes down:

R1              T1              T2              S1
-------------   -------------   -------------   ---------------
                LOOKUP "$D/foo"
                  (1)
                                                REPLY (1) "foo" F1
                start reply
                 processing
                 up to
                 extracting
                 post-op attrs
RENAME "$D/foo"
  "$D/baz" (2)
                                                REPLY (2)
                                GETATTR $D
                                 during lookup
                                 due to expiry
                                 (3)
                                                REPLY (3)
                                flush $D name
                                 cache entries
                                 due to updated
                                 timestamp
                                LOOKUP "$D/baz"
                                 (4)
                                                REPLY (4) "baz" F1
                                process reply,
                                  including
                                  post-op attrs
                                  that set F1's
                                  cached attrs
                                  to a ctime
                                  post RENAME

                resume reply    finish reply
                 processing       processing
                 including        including
                 setting F1's     setting F1's
                 n_ctime and      n_ctime and
                 adding cache     adding cache
                 entry            entry

At the end of this, the "victim" NFS client now has two name cache
entries for "$D/foo" and "$D/baz" that point to the F1 filehandle.
The n_ctime used to validate these name cache hits in nfs_lookup() is
already updated to post RENAME, so nfs_lookup() will trust these
entries until a future change to F1's i-node.  Further, "$D"'s local
attribute cache already reflects the updated ctime post RENAME, so it
will not flush it's name cache entries until a future change to the
directory.

The root problem is that the name cache entry for "foo" was added
using the wrong ctime.  It really should be using the F1 attributes in
the post-op attributes from the LOOKUP reply, not from F1's local
attribute cache.  However, just changing that is not sufficient.
There are still races with the calls to cache_enter() and updating
n_ctime.

What I concluded is that it would really be far simpler and more
obvious if the cached timestamps were stored in the namecache entry
directly rather than having multiple name cache entries validated by
shared state in the nfsnode.  This does mean allowing the name cache
to hold some filesystem-specific state.  However, I felt this was much
cleaner than adding a lot more complexity to nfs_lookup().  Also, this
turns out to be fairly non-invasive to implement since nfs_lookup()
calls cache_lookup() directly, but other filesystems only call it
indirectly via vfs_cache_lookup().  I considered letting filesystems
store a void * cookie in the name cache entry and having them provide
a destructor, etc.  However, that would require extra allocations for
NFS lookups.  Instead, I just adjusted the name cache API to
explicitly allow the filesystem to store a single timestamp in a name
cache entry by adding a new 'cache_enter_time()' that accepts a struct
timespec that is copied into the entry.  'cache_enter_time()' also
saves the current value of 'ticks' in the entry.  'cache_lookup()' is
modified to add two new arguments used to return the timespec and
ticks value used for a namecache entry when a hit in the cache occurs.

One wrinkle with this is that the name cache does not create actual
entries for ".", and thus it would not store any timestamps for those
lookups.  To fix this I changed the NFS client to explicitly fast-path
lookups of "." by always returning the current directory as setup by
cache_lookup() and never bothering to do a LOOKUP or check for stale
attributes in that case.

The current patch against 8 is at
http://www.FreeBSD.org/~jhb/patches/nfs_lookup.patch

It includes ABI and API compat shims so that it is suitable for
merging to stable branches.  For HEAD I would likely retire the
cache_lookup_times() name and just change all the callers of
cache_lookup() (there are only a handful, and nwfs and smbfs might
benefit from this functionality anyway).

-- 
John Baldwin