From owner-freebsd-fs@FreeBSD.ORG Sat Jan 21 22:12:28 2012 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 61AAF106566B; Sat, 21 Jan 2012 22:12:28 +0000 (UTC) (envelope-from jhb@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 1C37B8FC19; Sat, 21 Jan 2012 22:12:28 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [96.47.65.170]) by cyrus.watson.org (Postfix) with ESMTPSA id 945E946B0C; Sat, 21 Jan 2012 17:12:27 -0500 (EST) Received: from John-Baldwins-MacBook-Air.local (c-68-36-150-83.hsd1.nj.comcast.net [68.36.150.83]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 01881B915; Sat, 21 Jan 2012 17:12:26 -0500 (EST) Message-ID: <4F1B384A.5070506@FreeBSD.org> Date: Sat, 21 Jan 2012 17:12:26 -0500 From: John Baldwin User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:9.0) Gecko/20111222 Thunderbird/9.0.1 MIME-Version: 1.0 To: Kostik Belousov References: <201201181707.21293.jhb@freebsd.org> <201201191026.09431.jhb@freebsd.org> <20120119160156.GF31224@deviant.kiev.zoral.com.ua> <201201191117.28128.jhb@freebsd.org> <20120121081257.GS31224@deviant.kiev.zoral.com.ua> In-Reply-To: <20120121081257.GS31224@deviant.kiev.zoral.com.ua> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Sat, 21 Jan 2012 17:12:27 -0500 (EST) Cc: Rick Macklem , fs@freebsd.org, Peter Wemm Subject: Re: Race in NFS lookup can result in stale namecache entries X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 21 Jan 2012 22:12:28 -0000 On 1/21/12 3:12 AM, Kostik Belousov wrote: > On Thu, Jan 19, 2012 at 11:17:28AM -0500, John Baldwin wrote: >> On Thursday, January 19, 2012 11:01:56 am Kostik Belousov wrote: >>> On Thu, Jan 19, 2012 at 10:26:09AM -0500, John Baldwin wrote: >>>> On Thursday, January 19, 2012 9:06:13 am Kostik Belousov wrote: >>>>> On Wed, Jan 18, 2012 at 05:07:21PM -0500, John Baldwin wrote: >>>>> ... >>>>>> What I concluded is that it would really be far simpler and more >>>>>> obvious if the cached timestamps were stored in the namecache entry >>>>>> directly rather than having multiple name cache entries validated by >>>>>> shared state in the nfsnode. This does mean allowing the name cache >>>>>> to hold some filesystem-specific state. However, I felt this was much >>>>>> cleaner than adding a lot more complexity to nfs_lookup(). Also, this >>>>>> turns out to be fairly non-invasive to implement since nfs_lookup() >>>>>> calls cache_lookup() directly, but other filesystems only call it >>>>>> indirectly via vfs_cache_lookup(). I considered letting filesystems >>>>>> store a void * cookie in the name cache entry and having them provide >>>>>> a destructor, etc. However, that would require extra allocations for >>>>>> NFS lookups. Instead, I just adjusted the name cache API to >>>>>> explicitly allow the filesystem to store a single timestamp in a name >>>>>> cache entry by adding a new 'cache_enter_time()' that accepts a struct >>>>>> timespec that is copied into the entry. 'cache_enter_time()' also >>>>>> saves the current value of 'ticks' in the entry. 'cache_lookup()' is >>>>>> modified to add two new arguments used to return the timespec and >>>>>> ticks value used for a namecache entry when a hit in the cache occurs. >>>>>> >>>>>> One wrinkle with this is that the name cache does not create actual >>>>>> entries for ".", and thus it would not store any timestamps for those >>>>>> lookups. To fix this I changed the NFS client to explicitly fast-path >>>>>> lookups of "." by always returning the current directory as setup by >>>>>> cache_lookup() and never bothering to do a LOOKUP or check for stale >>>>>> attributes in that case. >>>>>> >>>>>> The current patch against 8 is at >>>>>> http://www.FreeBSD.org/~jhb/patches/nfs_lookup.patch >>>>> ... >>>>> >>>>> So now you add 8*2+4 bytes to each namecache entry on amd64 unconditionally. >>>>> Current size of the struct namecache invariant part on amd64 is 72 bytes, >>>>> so addition of 20 bytes looks slightly excessive. I am not sure about >>>>> typical distribution of the namecache nc_name length, so it is unobvious >>>>> does the change changes the memory usage significantly. >>>>> >>>>> A flag could be added to nc_flags to indicate the presence of timestamp. >>>>> The timestamps would be conditionally placed after nc_nlen, we probably >>>>> could use union to ease the access. Then, the direct dereferences of >>>>> nc_name would need to be converted to some inline function. >>>>> >>>>> I can do this after your patch is committed, if you consider the memory >>>>> usage saving worth it. >>>> >>>> Hmm, if the memory usage really is worrying then I could move to using the >>>> void * cookie method instead. >>> >>> I think the current approach is better then cookie that again will be >>> used only for NFS. With the cookie, you still has 8 bytes for each ncp. >>> With union, you do not have the overhead for !NFS. >>> >>> Default setup allows for ~300000 vnodes on not too powerful amd64 machine, >>> the ncsizefactor 2 together with 8 bytes for cookie is 4.5MB. For 20 bytes >>> per ncp, we get 12MB overhead. >> >> Ok. If you want to tackle the union bits I'm happy to let you do so. That >> will at least break up the changes a bit. > > Below is my take. First version of the patch added both small and large > zones with ts, but later I decided that large does not make sense. > If wanted, it can be restored easily. This looks good to me. I think you are fine with always using the _ts structure for the large case. -- John Baldwin