From owner-freebsd-fs@FreeBSD.ORG Sun Oct 15 20:59:28 2006 Return-Path: X-Original-To: fs@freebsd.org Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id DF93516A403 for ; Sun, 15 Oct 2006 20:59:28 +0000 (UTC) (envelope-from mohan_srinivasan@yahoo.com) Received: from web30813.mail.mud.yahoo.com (web30813.mail.mud.yahoo.com [68.142.201.139]) by mx1.FreeBSD.org (Postfix) with SMTP id 48B6143D77 for ; Sun, 15 Oct 2006 20:59:24 +0000 (GMT) (envelope-from mohan_srinivasan@yahoo.com) Received: (qmail 49806 invoked by uid 60001); 15 Oct 2006 20:59:23 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:Received:Date:From:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=R0/MP9JLtGZo49ZRVB5fPrdVjLR7tTpRvsarQphAO68+8I1ZqB8/IDHr0heP0xELYmGfsqBbxirE8hwnevS2fGSzxzeTOWsRiS4fYW3uUwfIK7CDxDBb4E0DvDa33zHkpzTKUH1D/S9PWkT2smZVhrMZ3G0PUHNPqKL9pgkYEHM= ; Message-ID: <20061015205923.49804.qmail@web30813.mail.mud.yahoo.com> Received: from [71.139.1.197] by web30813.mail.mud.yahoo.com via HTTP; Sun, 15 Oct 2006 13:59:23 PDT Date: Sun, 15 Oct 2006 13:59:23 -0700 (PDT) From: Mohan Srinivasan To: Bruce Evans , fs@freebsd.org In-Reply-To: <20061014143825.F1264@epsplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit Cc: mohans@freebsd.org Subject: Re: lost dotdot caching pessimizes nfs especially X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 15 Oct 2006 20:59:29 -0000 Bruce Defending the "silliness" of the first of the 2 changes you cite as I am the author of that change. Just got back from a short break and am still catching up on this thread. The clearing of the attrcache on nfs_open() is a requirement for close-to-open consistency, and this change fixed bugs that we saw internally relating to close-to-open consistency. > and associated changes give silly behaviour that almost doubles the > number of Access RPCs. One of the associated changes clears n_attrstamp > on close(). Then on open(), since lookup() is called before the above > is reached, nfs_access_otw() has always just been called, and the above ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > forces another call. That is not true with NFSv2 which doesn't have an access call, in which case nfsspec_access() calls VOP_GETATTR, which may or may not go over the wire. Also, what would happen with NFSv3 if we get an access cache hit ? If lookup() can be made to pass a flag into nfs_open() that an otw getattr was done, then we can eliminate the clearing of the attrcache in nfs_open(). But absent that flag, I don't see how you can eliminate the fetch of fresh attributes in nfs_open(). mohan From owner-freebsd-fs@FreeBSD.ORG Sun Oct 15 21:08:46 2006 Return-Path: X-Original-To: fs@freebsd.org Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id CC99A16A494 for ; Sun, 15 Oct 2006 21:08:46 +0000 (UTC) (envelope-from mohan_srinivasan@yahoo.com) Received: from web30812.mail.mud.yahoo.com (web30812.mail.mud.yahoo.com [68.142.201.138]) by mx1.FreeBSD.org (Postfix) with SMTP id 4F7CF43D76 for ; Sun, 15 Oct 2006 21:08:43 +0000 (GMT) (envelope-from mohan_srinivasan@yahoo.com) Received: (qmail 89309 invoked by uid 60001); 15 Oct 2006 21:08:42 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:Received:Date:From:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=zSDNjdpoGTccssaxhSCyIqnfgwU20SbOVRg86Ta1qBqcW5+NFMhVbjq9lfuTg9Q+hFAXQtvU2NEg+5S4Mk3/HB0rgGLtjSytfRfpAnfGa7vkK8HK1LHmxEGINXSqNraAxAUb99fydt5uEYLZIbb7YJhd2mrj9vZpLYt3fFNSaks= ; Message-ID: <20061015210842.89307.qmail@web30812.mail.mud.yahoo.com> Received: from [71.139.1.197] by web30812.mail.mud.yahoo.com via HTTP; Sun, 15 Oct 2006 14:08:42 PDT Date: Sun, 15 Oct 2006 14:08:42 -0700 (PDT) From: Mohan Srinivasan To: Mohan Srinivasan , Bruce Evans , fs@freebsd.org In-Reply-To: <20061015205923.49804.qmail@web30813.mail.mud.yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit Cc: mohans@freebsd.org Subject: Re: lost dotdot caching pessimizes nfs especially X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 15 Oct 2006 21:08:46 -0000 Bruce, Not sure if you are committing a change eliminating that line in nfs_open() that clears the attrcache. But if you're doing so, please test to make sure you don't break close-to-open consistency. If you're going to optimize, please give priority to correctness first. If you are convinced that a lookup() will fetch fresh attrs in *all* cases, then by all means go ahead and remove that line. mohan --- Mohan Srinivasan wrote: > Bruce > > Defending the "silliness" of the first of the 2 changes you cite as I am the author of that > change. Just got back from a short break and am still catching up on this thread. > > The clearing of the attrcache on nfs_open() is a requirement for close-to-open > consistency, and this change fixed bugs that we saw internally relating to > close-to-open consistency. > > > and associated changes give silly behaviour that almost doubles the > > number of Access RPCs. One of the associated changes clears n_attrstamp > > on close(). Then on open(), since lookup() is called before the above > > is reached, nfs_access_otw() has always just been called, and the above > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > forces another call. > > That is not true with NFSv2 which doesn't have an access call, in which case > nfsspec_access() calls VOP_GETATTR, which may or may not go over the wire. > > Also, what would happen with NFSv3 if we get an access cache hit ? > > If lookup() can be made to pass a flag into nfs_open() that an otw getattr was > done, then we can eliminate the clearing of the attrcache in nfs_open(). But > absent that flag, I don't see how you can eliminate the fetch of fresh attributes > in nfs_open(). > > mohan > From owner-freebsd-fs@FreeBSD.ORG Sun Oct 15 21:26:22 2006 Return-Path: X-Original-To: fs@freebsd.org Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id F20A216A407 for ; Sun, 15 Oct 2006 21:26:22 +0000 (UTC) (envelope-from rick@snowhite.cis.uoguelph.ca) Received: from dargo.cs.uoguelph.ca (dargo.cs.uoguelph.ca [131.104.94.197]) by mx1.FreeBSD.org (Postfix) with ESMTP id DD14C43D73 for ; Sun, 15 Oct 2006 21:26:21 +0000 (GMT) (envelope-from rick@snowhite.cis.uoguelph.ca) Received: from snowhite.cis.uoguelph.ca (snowhite.cis.uoguelph.ca [131.104.48.1]) by dargo.cs.uoguelph.ca (8.13.1/8.13.1) with ESMTP id k9FLQFnp020996; Sun, 15 Oct 2006 17:26:15 -0400 Received: (from rick@localhost) by snowhite.cis.uoguelph.ca (8.9.3/8.9.3) id RAA49191; Sun, 15 Oct 2006 17:26:20 -0400 (EDT) Date: Sun, 15 Oct 2006 17:26:20 -0400 (EDT) From: rick@snowhite.cis.uoguelph.ca Message-Id: <200610152126.RAA49191@snowhite.cis.uoguelph.ca> To: fs@freebsd.org X-Scanned-By: MIMEDefang 2.52 on 131.104.94.197 Cc: mohan_srinivasan@yahoo.com Subject: Re: lost dotdot caching pessimizes nfs especially X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 15 Oct 2006 21:26:23 -0000 > The clearing of the attrcache on nfs_open() is a requirement for close-to-open > consistency, and this change fixed bugs that we saw internally relating to > close-to-open consistency. I thought I'd just throw out some comments w.r.t. close-to-open consistency. The concept comes from the Andrew File System (before Transarc's AFS), where the client read the entire file upon Open and wrote the entire file to the server upon Close, if it was modified. Therefore, other clients that opened the file after the Close were guaranteed to see the changes. To the best of my knowledge, no NFS RFC has even required this behaviour. It became common practice to flush writes to a server upon Close, so that errors like ENOSPC could be returned by close(2) and a process could be confident that the file was successfully saved if it didn't get an error return from any write(2) syscall nor the subsequent close(2). As a side effect of the above behaviour (not required by RFC, but common practice), NFS clients provided "approximate close-to-open consistency". The "approximate" came from the fact that another client wouldn't notice that the file had been modified until its attribute cache had timed out, a few seconds after the writing client had flushed its writes upon close. Somewhere along the way, some people seem to have decided that close-to-open consistency is required of NFS clients. I think the Linux crowd is in that camp? Since NFS doesn't have a cache coherency protocol (even for NFSv4, although the caching rules are somewhat more explicit in RFC3530), it is always a performance<->consistency tradeoff. So, I guess you guys will have to decide, rick ps: I do believe software that expects strict close-to-open consistency over NFS is not correct, because that is not a requirement of the RFCs. From owner-freebsd-fs@FreeBSD.ORG Sun Oct 15 21:35:14 2006 Return-Path: X-Original-To: fs@freebsd.org Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 36B5C16A403 for ; Sun, 15 Oct 2006 21:35:14 +0000 (UTC) (envelope-from mohan_srinivasan@yahoo.com) Received: from web30813.mail.mud.yahoo.com (web30813.mail.mud.yahoo.com [68.142.201.139]) by mx1.FreeBSD.org (Postfix) with SMTP id DEEB143D4C for ; Sun, 15 Oct 2006 21:35:09 +0000 (GMT) (envelope-from mohan_srinivasan@yahoo.com) Received: (qmail 60931 invoked by uid 60001); 15 Oct 2006 21:35:09 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:Received:Date:From:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=Esd+M0gwv7AP+XDiSdAiLx/QKt9R1xptkpnXb4+zmqvDVZBMp8lUgpGqt4z2qitkpSA9HiQTUvh8eCW6GqEZ7KLRgvrH5HcXNPVoFyUjDfa42LjgtdZufEfD7Zt5dm8wYQxJfngdANynQ3MeSupw6FPPx+/vZH21dZ/VWOA3Jss= ; Message-ID: <20061015213509.60929.qmail@web30813.mail.mud.yahoo.com> Received: from [71.139.1.197] by web30813.mail.mud.yahoo.com via HTTP; Sun, 15 Oct 2006 14:35:09 PDT Date: Sun, 15 Oct 2006 14:35:09 -0700 (PDT) From: Mohan Srinivasan To: rick@snowhite.cis.uoguelph.ca, fs@freebsd.org In-Reply-To: <200610152126.RAA49191@snowhite.cis.uoguelph.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit Cc: mohan_srinivasan@yahoo.com Subject: Re: lost dotdot caching pessimizes nfs especially X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 15 Oct 2006 21:35:14 -0000 SunOS and Solaris have provided close-to-open consistency for a very long time (for at least 15 years now). Not having the NFS client enforce close-to-open consistency will break a heck of a lot of applications. Since the other NFS clients (that matter) Solaris and Linux support it, I would argue that not supporting cto consistency is not really an option. We can however provide a mount option "nocto" (like those clients do) that overrides the default for specific cases (read only mounts, single client mounts etc). mohan --- rick@snowhite.cis.uoguelph.ca wrote: > > The clearing of the attrcache on nfs_open() is a requirement for close-to-open > > consistency, and this change fixed bugs that we saw internally relating to > > close-to-open consistency. > > I thought I'd just throw out some comments w.r.t. close-to-open consistency. > The concept comes from the Andrew File System (before Transarc's AFS), where > the client read the entire file upon Open and wrote the entire file to the > server upon Close, if it was modified. Therefore, other clients that opened > the file after the Close were guaranteed to see the changes. > > To the best of my knowledge, no NFS RFC has even required this behaviour. > It became common practice to flush writes to a server upon Close, so that > errors like ENOSPC could be returned by close(2) and a process could be > confident that the file was successfully saved if it didn't get an error > return from any write(2) syscall nor the subsequent close(2). > > As a side effect of the above behaviour (not required by RFC, but common > practice), NFS clients provided "approximate close-to-open consistency". > The "approximate" came from the fact that another client wouldn't notice > that the file had been modified until its attribute cache had timed out, > a few seconds after the writing client had flushed its writes upon close. > > Somewhere along the way, some people seem to have decided that > close-to-open consistency is required of NFS clients. I think the Linux crowd > is in that camp? > > Since NFS doesn't have a cache coherency protocol (even for NFSv4, although > the caching rules are somewhat more explicit in RFC3530), it is always > a performance<->consistency tradeoff. > > So, I guess you guys will have to decide, rick > ps: I do believe software that expects strict close-to-open consistency > over NFS is not correct, because that is not a requirement of the RFCs. > From owner-freebsd-fs@FreeBSD.ORG Mon Oct 16 06:30:02 2006 Return-Path: X-Original-To: fs@freebsd.org Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 48A0116A47C; Mon, 16 Oct 2006 06:30:02 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout1.pacific.net.au (mailout1-3.pacific.net.au [61.8.2.210]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5BAE743D5E; Mon, 16 Oct 2006 06:30:01 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.2.163]) by mailout1.pacific.net.au (Postfix) with ESMTP id D3A99328DA5; Mon, 16 Oct 2006 16:29:59 +1000 (EST) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailproxy2.pacific.net.au (Postfix) with ESMTP id 90A642741E; Mon, 16 Oct 2006 16:29:58 +1000 (EST) Date: Mon, 16 Oct 2006 16:29:57 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Mohan Srinivasan In-Reply-To: <20061015205923.49804.qmail@web30813.mail.mud.yahoo.com> Message-ID: <20061016130540.C63585@delplex.bde.org> References: <20061015205923.49804.qmail@web30813.mail.mud.yahoo.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: fs@freebsd.org, mohans@freebsd.org Subject: Re: lost dotdot caching pessimizes nfs especially X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Oct 2006 06:30:02 -0000 On Sun, 15 Oct 2006, Mohan Srinivasan wrote: > The clearing of the attrcache on nfs_open() is a requirement for close-to-open > consistency, and this change fixed bugs that we saw internally relating to > close-to-open consistency. > >> and associated changes give silly behaviour that almost doubles the >> number of Access RPCs. One of the associated changes clears n_attrstamp >> on close(). Then on open(), since lookup() is called before the above >> is reached, nfs_access_otw() has always just been called, and the above > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> forces another call. > > That is not true with NFSv2 which doesn't have an access call, in which case > nfsspec_access() calls VOP_GETATTR, which may or may not go over the wire. > > Also, what would happen with NFSv3 if we get an access cache hit ? I didn't think about NFSv2 or check the details for NFSv3 until now. It is nfs_lookup() that always calls VOP_GETATTR(), and VOP_GETATTR() must go on the wire in the case being described (lookup after close) since we flushed the attribute cache entry in nfs_close(). The difference for v2 is that nfs_getattr() normally uses a Getattr request for v2 and an Access request for v3. For NFSv3, nfs_lookup()'s behaviour is correct for the attribute cache is not as good as it could easily be for the attribute cache. In nfs_lookup() after a recent close(), in the usual cases all caches are hit except we just cleared the attribute cache, so nfs_lookup() does the following: VOP_ACCESS(); # Cache hit. Access granted. cache_lookup(); # Positive cache hit. VOP_GETATTR(); # Cache miss. Succeeds. # Now we have fresh attributes in the v3 case, but we granted access # based on the old attributes, so we unnecessarily lost full # open/close consistency. In unusual cases, there is an acccess cache miss. Then for v3, VOP_ACCESS() refreshes the attribute cache too, VOP_GETTATR() is a cache hit, and there is full open/close consistency. > If lookup() can be made to pass a flag into nfs_open() that an otw getattr was > done, then we can eliminate the clearing of the attrcache in nfs_open(). But > absent that flag, I don't see how you can eliminate the fetch of fresh attributes > in nfs_open(). Of course something like such a flag is needed. See my previous mail for more details (there should be another flag for nfs_lookup() so that the entire open() is consistent). For nfs_open(), I was thinking more of a generation count. Now I wonder about exclusive locking and blockages. VOP_OPEN() is now exclusively locked, but I don't now if the same lock covers the lookup. With exclusive locking, not even a flag is needed. Without exclusive locking, blocking might be a problem. Bruce From owner-freebsd-fs@FreeBSD.ORG Mon Oct 16 06:41:11 2006 Return-Path: X-Original-To: fs@freebsd.org Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 5EAAC16A40F for ; Mon, 16 Oct 2006 06:41:11 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout1.pacific.net.au (mailout1-3.pacific.net.au [61.8.2.210]) by mx1.FreeBSD.org (Postfix) with ESMTP id F107A43D53 for ; Mon, 16 Oct 2006 06:41:10 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au [61.8.2.162]) by mailout1.pacific.net.au (Postfix) with ESMTP id 3CCBE69FDAF; Mon, 16 Oct 2006 16:41:10 +1000 (EST) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailproxy1.pacific.net.au (Postfix) with ESMTP id 57F858C3A; Mon, 16 Oct 2006 16:41:09 +1000 (EST) Date: Mon, 16 Oct 2006 16:41:08 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Mohan Srinivasan In-Reply-To: <20061015213509.60929.qmail@web30813.mail.mud.yahoo.com> Message-ID: <20061016163015.C63585@delplex.bde.org> References: <20061015213509.60929.qmail@web30813.mail.mud.yahoo.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: fs@freebsd.org, rick@snowhite.cis.uoguelph.ca Subject: Re: lost dotdot caching pessimizes nfs especially X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Oct 2006 06:41:11 -0000 On Sun, 15 Oct 2006, Mohan Srinivasan wrote: > of applications. Since the other NFS clients (that matter) Solaris and Linux support it, > I would argue that not supporting cto consistency is not really an option. I agree. > We can however provide a mount option "nocto" (like those clients do) that overrides > the default for specific cases (read only mounts, single client mounts etc). PR 78673 has a patch to break consistency unconditionally for r/o mounts. I use this, but it doesn't help for my most active file system (/usr/obj) since that needs to be r/w. It is obviously wrong to do this unconditonally on the client. It is the server's read-onlyness that matters. I don't know how to track the server's read-onlyness short of asking it on every open() or Access. Bruce From owner-freebsd-fs@FreeBSD.ORG Mon Oct 16 15:32:26 2006 Return-Path: X-Original-To: fs@freebsd.org Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 76A0F16A403 for ; Mon, 16 Oct 2006 15:32:26 +0000 (UTC) (envelope-from rick@snowhite.cis.uoguelph.ca) Received: from dargo.cs.uoguelph.ca (dargo.cs.uoguelph.ca [131.104.94.197]) by mx1.FreeBSD.org (Postfix) with ESMTP id C58B443D76 for ; Mon, 16 Oct 2006 15:32:24 +0000 (GMT) (envelope-from rick@snowhite.cis.uoguelph.ca) Received: from snowhite.cis.uoguelph.ca (snowhite.cis.uoguelph.ca [131.104.48.1]) by dargo.cs.uoguelph.ca (8.13.1/8.13.1) with ESMTP id k9GFWLtJ013264; Mon, 16 Oct 2006 11:32:21 -0400 Received: (from rick@localhost) by snowhite.cis.uoguelph.ca (8.9.3/8.9.3) id LAA59652; Mon, 16 Oct 2006 11:32:36 -0400 (EDT) Date: Mon, 16 Oct 2006 11:32:36 -0400 (EDT) From: rick@snowhite.cis.uoguelph.ca Message-Id: <200610161532.LAA59652@snowhite.cis.uoguelph.ca> To: fs@freebsd.org X-Scanned-By: MIMEDefang 2.52 on 131.104.94.197 Cc: mohan_srinivasan@yahoo.com Subject: Re: lost dotdot caching pessimizes nfs especially X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Oct 2006 15:32:26 -0000 > SunOS and Solaris have provided close-to-open consistency for a very long time (for > at least 15 years now). I believe you (and I understand the principal "if Solaris does X, that's the way it needs to be"). Until very recently, Solaris sources weren't open and I didn't have access to them. That was important in the "bad old days", since Sun's Legal Beagles did once send a threatening letter w.r.t. my NFS violating their proprietary... I'd argue that these days it's "if Linux does X, that's the way it needs to be".:-) > Not having the NFS client enforce close-to-open consistency will break a heck of a lot > of applications. Since the other NFS clients (that matter) Solaris and Linux support it, > I would argue that not supporting cto consistency is not really an option. I am a bit surprised that a lot of applications break. To do so, they must be running on multiple clients, read/write sharing the same NFS mounted file(s) and use some back-end protocol that says something like "I've just closed it, so you can now open it" so inconsistencies < 1 minute, causes problems. I wish those applications had been common a decade ago becuase the NFS community might have cared about cache consistency and something along the lines of my experimental NQNFS might have happenned. NFSv4 simply says that clients that care about cache consistency should use byte range locking. Here's what RFC3530 (the NFSv4 RFC) says: at top of Page 14: If an application wants to serialize access to file data, file locking of the file data ranges in question should be used. Admittedly the paragraph that preceeds this almost says what Solaris is doing and seems to contradict Sec 9: 9. Client-Side Caching Client-side caching of data, of file attributes, and of file names is essential to providing good performance with the NFS protocol. Providing distributed cache coherence is a difficult problem and previous versions of the NFS protocol have not attempted it. Instead, several NFS client implementation techniques have been used to reduce the problems that a lack of coherence poses for users. These techniques have not been clearly defined by earlier protocol specifications and it is often unclear what is valid or invalid client behavior. The NFS version 4 protocol uses many techniques similar to those that have been used in previous protocol versions. The NFS version 4 protocol does not provide distributed cache coherence. However, it defines a more limited set of caching guarantees to allow locks and share reservations to be used without destructive interference from client side caching. In addition, the NFS version 4 protocol introduces a delegation mechanism which allows many decisions normally made by the server to be made locally by clients. This mechanism provides efficient support of the common cases where sharing is infrequent or where sharing is read-only. Once clients figure out how to effectively use Delegations, I think there will be significant performance improvements. Unfortunately, we haven't gotten there yet. For NFSv2/3, maintaining close-to-open consistency may be appropriate (and necessary), but it can result in a big performance hit. For example, try an experiment where you turn off "push writes on close" in the code and see what effect that has on performance for the common, non-read/write shared files case. (nb: I don't think you can get away with doing this without some other cache consistency guarantees, but it would sure be nice from a performance point of view. That was what nqnfs was all about. If you are really bored, you can go to http://www.cis.uoguelph.ca/~nfsv4 and read my ancient nqnfs paper.) rick From owner-freebsd-fs@FreeBSD.ORG Mon Oct 16 21:42:53 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 5E88516A417 for ; Mon, 16 Oct 2006 21:42:53 +0000 (UTC) (envelope-from lgusenet@be-well.ilk.org) Received: from mail6.sea5.speakeasy.net (mail6.sea5.speakeasy.net [69.17.117.8]) by mx1.FreeBSD.org (Postfix) with ESMTP id 3629843D67 for ; Mon, 16 Oct 2006 21:42:35 +0000 (GMT) (envelope-from lgusenet@be-well.ilk.org) Received: (qmail 10451 invoked from network); 16 Oct 2006 21:42:35 -0000 Received: from dsl092-078-145.bos1.dsl.speakeasy.net (HELO be-well.ilk.org) ([66.92.78.145]) (envelope-sender ) by mail6.sea5.speakeasy.net (qmail-ldap-1.03) with SMTP for ; 16 Oct 2006 21:42:35 -0000 Received: by be-well.ilk.org (Postfix, from userid 1147) id CC7632842E; Mon, 16 Oct 2006 17:42:33 -0400 (EDT) To: freebsd-fs@freebsd.org To: absorbb@gmail.com (=?utf-8?B?0JjQu9GM0LTQsNGAINCd0YPRgNC40YHQu9Cw0Lw=?= =?utf-8?B?0L7Qsg==?=) References: <200609292159.18282.absorbb@gmail.com> From: Lowell Gilbert Date: Mon, 16 Oct 2006 17:42:33 -0400 In-Reply-To: <200609292159.18282.absorbb@gmail.com> (=?utf-8?B?0JjQu9GM?= =?utf-8?B?0LTQsNGAINCd0YPRgNC40YHQu9Cw0LzQvtCyJ3M=?= message of "Fri, 29 Sep 2006 21:59:17 +0400") Message-ID: <44d58sos3a.fsf@be-well.ilk.org> User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.0.50 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Cc: Subject: Re: ntfs broken when share through samba3 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: freebsd-fs@freebsd.org List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Oct 2006 21:42:53 -0000 absorbb@gmail.com (=D0=98=D0=BB=D1=8C=D0=B4=D0=B0=D1=80 =D0=9D=D1=83=D1=80= =D0=B8=D1=81=D0=BB=D0=B0=D0=BC=D0=BE=D0=B2) writes: > This old already reported bug. > But situation have'nt changed. For example, kern/86965. > There is very simple patch that fix this bug: > > --- usr/src/sys/fs/ntfs/ntfs_vnops.c Mon Mar 13 00:50:01 2006 > +++ home/voxel/stuff/ntfs_vnops.c Thu Aug 31 09:22:08 2006 > @@ -187,7 +187,8 @@ > vap->va_fsid =3D dev2udev(ip->i_dev); > vap->va_fileid =3D ip->i_number; > vap->va_mode =3D ip->i_mp->ntm_mode; > - vap->va_nlink =3D ip->i_nlink; > + vap->va_nlink =3D (ip->i_nlink ? ip->i_nlink : 1); > + //vap->va_nlink =3D ip->i_nlink; > vap->va_uid =3D ip->i_mp->ntm_uid; > vap->va_gid =3D ip->i_mp->ntm_gid; > vap->va_rdev =3D 0; /* XXX UNODEV ? */ > > but it seems to be not beaty solution Not beautiful, indeed.=20=20 I was playing around with this, and although that change would work around the problem in (at least) most cases, I am not sure that it is truly correct. I am not an expert at filesystems, and certainly have little knowledge of NTFS. However, my observations confuse me considerably. The main issue is that if you read from a file (on NTFS, with a link count of zero according to ls(1)), the link count becomes populated. I cannot see how that would happen, because the ntnode structure link count is not modified except when reading the whole structure from the disk, and the on-disk node is not being changed. To confuse things further, the link count is changed to 2, not 1, on ordinary files that have only a single directory entry. I do not believe that streams are at issue, because the file has no open file descriptors remaining according to fstat(1). Any thoughts from the experts? Might Darwin have any useful hints? From owner-freebsd-fs@FreeBSD.ORG Mon Oct 16 22:05:48 2006 Return-Path: X-Original-To: fs@freebsd.org Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 143DB16A51A for ; Mon, 16 Oct 2006 22:05:48 +0000 (UTC) (envelope-from mohan_srinivasan@yahoo.com) Received: from web30804.mail.mud.yahoo.com (web30804.mail.mud.yahoo.com [68.142.200.147]) by mx1.FreeBSD.org (Postfix) with SMTP id EAC0D43D95 for ; Mon, 16 Oct 2006 22:05:31 +0000 (GMT) (envelope-from mohan_srinivasan@yahoo.com) Received: (qmail 49387 invoked by uid 60001); 16 Oct 2006 22:05:31 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:Received:Date:From:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=xc6p1XFyVmMpPrOVhh7bgGqZpUDiOt26igmfMqAQ1AlPWxGJEL7+2dVTwDKCZwew9pJKSKFl72s16vGPwH44DY9+YpipSarrbd71OiwJ31ib3yqYyNpuZrWEQ2cDeOYyNNpKcJ9GlY3q0CZ82kwOBUpgSoyfCu9PXiDGVvBfMcw= ; Message-ID: <20061016220531.49385.qmail@web30804.mail.mud.yahoo.com> Received: from [207.126.239.39] by web30804.mail.mud.yahoo.com via HTTP; Mon, 16 Oct 2006 15:05:31 PDT Date: Mon, 16 Oct 2006 15:05:31 -0700 (PDT) From: Mohan Srinivasan To: rick@snowhite.cis.uoguelph.ca, fs@freebsd.org In-Reply-To: <200610161532.LAA59652@snowhite.cis.uoguelph.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit Cc: mohan_srinivasan@yahoo.com Subject: Re: lost dotdot caching pessimizes nfs especially X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Oct 2006 22:05:48 -0000 Hi Rick --- rick@snowhite.cis.uoguelph.ca wrote: > > SunOS and Solaris have provided close-to-open consistency for a very long time (for > > at least 15 years now). > > I believe you (and I understand the principal "if Solaris does X, that's the > way it needs to be"). Until very recently, Solaris sources weren't open and > I didn't have access to them. That was important in the "bad old days", since > Sun's Legal Beagles did once send a threatening letter w.r.t. my NFS > violating their proprietary... > > I'd argue that these days it's "if Linux does X, that's the way it needs > to be".:-) Probably so. But to my mind, Solaris still has the most robust NFS client implementation out there (I have no affiliation with Sun whatsoever), which is why I mentioned Solaris. I have not looked at Linux wrt cto consistency. > > Not having the NFS client enforce close-to-open consistency will break a heck of a lot > > of applications. Since the other NFS clients (that matter) Solaris and Linux support it, > > I would argue that not supporting cto consistency is not really an option. > > I am a bit surprised that a lot of applications break. To do so, they must > be running on multiple clients, read/write sharing the same NFS mounted file(s) > and use some back-end protocol that says something like "I've just closed it, > so you can now open it" so inconsistencies < 1 minute, causes problems. Such applications are very common. 1) An application where multiple clients can do something like this : Acquire a file lock; open(); Do I/O close(); Drop the file lock; Without cto consistency, there's no way this is going to work. That is what a large application where I am employed does. And I would think that would be very common case elsewhere too. (You can replace the file lock with a byte range lock post-open(), but that won't change the result). 2) Without cto consistency, something as simple as editing a file on one client and compiling it on another won't work anymore. Breaking this is sure to send people with pitchforks running after the perpetrator :) > NFSv4 simply > says that clients that care about cache consistency should use byte > range locking. Here's what RFC3530 (the NFSv4 RFC) says: I don't know anything about NFSv4 (or Delegations). I figured that by the time NFSv4 gains wide acceptance, I'd be happily retired :), so I haven't bothered. I am mostly interested in a robust NFSv3 implementation :) At least in the context of NFSv2/3, byte range locking is a necessary but not sufficient condition for correctness. For correctness, you require Byte Range Locking + Direct IO (or some equivalent that bypasses client caching). > For NFSv2/3, maintaining close-to-open consistency may be appropriate > (and necessary), but it can result in a big performance hit. Cool. So we agree on that :) I agree there's a performance hit. Which is unavoidable. Best we can do is mitigate it with things like a nocto mount option, adding namei flags that NFS can set if it did an otw getattr from the lookup etc. mohan From owner-freebsd-fs@FreeBSD.ORG Mon Oct 16 23:59:01 2006 Return-Path: X-Original-To: fs@freebsd.org Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 90BC416A412 for ; Mon, 16 Oct 2006 23:59:01 +0000 (UTC) (envelope-from andrew@areilly.bpa.nu) Received: from omta03sl.mx.bigpond.com (omta03sl.mx.bigpond.com [144.140.92.155]) by mx1.FreeBSD.org (Postfix) with ESMTP id 72BA143D5C for ; Mon, 16 Oct 2006 23:58:59 +0000 (GMT) (envelope-from andrew@areilly.bpa.nu) Received: from areilly.bpa.nu ([141.168.2.3]) by omta03sl.mx.bigpond.com with ESMTP id <20061016235857.VHUV8785.omta03sl.mx.bigpond.com@areilly.bpa.nu> for ; Mon, 16 Oct 2006 23:58:57 +0000 Received: (qmail 38750 invoked by uid 501); 16 Oct 2006 23:56:58 -0000 Date: Tue, 17 Oct 2006 09:56:58 +1000 From: Andrew Reilly To: Mohan Srinivasan Message-ID: <20061016235658.GA38613@duncan.reilly.home> References: <200610161532.LAA59652@snowhite.cis.uoguelph.ca> <20061016220531.49385.qmail@web30804.mail.mud.yahoo.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20061016220531.49385.qmail@web30804.mail.mud.yahoo.com> User-Agent: Mutt/1.4.2.2i Cc: fs@freebsd.org, rick@snowhite.cis.uoguelph.ca Subject: Re: lost dotdot caching pessimizes nfs especially X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Oct 2006 23:59:01 -0000 On Mon, Oct 16, 2006 at 03:05:31PM -0700, Mohan Srinivasan wrote: > 2) Without cto consistency, something as simple as editing a file on one > client and compiling it on another won't work anymore. Breaking this is > sure to send people with pitchforks running after the perpetrator :) When I was doing lots of NFS-hosted development, years ago, I quickly learned to edit on the same machine as I was building on. X makes that easy. That was in the late-80s-early-90s time frame: I haven't had much use for NFS since then. That's changing now, though. That's not to say that this is the way that it should be. Just that it's the way that it always used to be. Cheers, -- Andrew From owner-freebsd-fs@FreeBSD.ORG Tue Oct 17 18:13:41 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 4ACFE16A403 for ; Tue, 17 Oct 2006 18:13:41 +0000 (UTC) (envelope-from mday@apple.com) Received: from mail-out3.apple.com (mail-out3.apple.com [17.254.13.22]) by mx1.FreeBSD.org (Postfix) with ESMTP id 54A2F43D5A for ; Tue, 17 Oct 2006 18:13:32 +0000 (GMT) (envelope-from mday@apple.com) Received: from relay8.apple.com (a17-128-113-38.apple.com [17.128.113.38]) by mail-out3.apple.com (8.12.11/8.12.11) with ESMTP id k9HIDWA2021194 for ; Tue, 17 Oct 2006 11:13:32 -0700 (PDT) Received: from [17.202.43.217] (unknown [17.202.43.217]) by relay8.apple.com (Apple SCV relay) with ESMTP id EAB69638 for ; Tue, 17 Oct 2006 11:13:31 -0700 (PDT) Message-Id: From: Mark Day To: freebsd-fs@freebsd.org In-Reply-To: <44d58sos3a.fsf@be-well.ilk.org> Content-Type: text/plain; charset=UTF-8; format=flowed; delsp=yes X-Smtp-Server: relay.apple.com Mime-Version: 1.0 (Apple Message framework v851) Content-Transfer-Encoding: quoted-printable Date: Tue, 17 Oct 2006 11:13:31 -0700 References: <200609292159.18282.absorbb@gmail.com> <44d58sos3a.fsf@be-well.ilk.org> X-Mailer: Apple Mail (2.851) X-Brightmail-Tracker: AAAAAA== Subject: Re: ntfs broken when share through samba3 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Oct 2006 18:13:41 -0000 On Oct 16, 2006, at 2:42 PM, Lowell Gilbert wrote: > absorbb@gmail.com (=D0=98=D0=BB=D1=8C=D0=B4=D0=B0=D1=80 = =D0=9D=D1=83=D1=80=D0=B8=D1=81=D0=BB=D0=B0=D0=BC=D0=BE=D0=B2) writes: > >> This old already reported bug. >> But situation have'nt changed. > > For example, kern/86965. > >> There is very simple patch that fix this bug: >> >> --- usr/src/sys/fs/ntfs/ntfs_vnops.c Mon Mar 13 00:50:01 2006 >> +++ home/voxel/stuff/ntfs_vnops.c Thu Aug 31 09:22:08 2006 >> @@ -187,7 +187,8 @@ >> vap->va_fsid =3D dev2udev(ip->i_dev); >> vap->va_fileid =3D ip->i_number; >> vap->va_mode =3D ip->i_mp->ntm_mode; >> - vap->va_nlink =3D ip->i_nlink; >> + vap->va_nlink =3D (ip->i_nlink ? ip->i_nlink : 1); >> + //vap->va_nlink =3D ip->i_nlink; >> vap->va_uid =3D ip->i_mp->ntm_uid; >> vap->va_gid =3D ip->i_mp->ntm_gid; >> vap->va_rdev =3D 0; /* XXX UNODEV ? = */ >> >> but it seems to be not beaty solution > > Not beautiful, indeed. > > I was playing around with this, and although that change would work > around the problem in (at least) most cases, I am not sure that it is > truly correct. > > I am not an expert at filesystems, and certainly have little knowledge > of NTFS. However, my observations confuse me considerably. The main > issue is that if you read from a file (on NTFS, with a link count of > zero according to ls(1)), the link count becomes populated. I cannot > see how that would happen, because the ntnode structure link count is > not modified except when reading the whole structure from the disk, > and the on-disk node is not being changed. To confuse things further, > the link count is changed to 2, not 1, on ordinary files that have > only a single directory entry. I do not believe that streams are at > issue, because the file has no open file descriptors remaining > according to fstat(1). IIRC, the NTFS code tries to populate a vnode based on the limited =20 information present in the directory entries it sees. It's trying to =20= avoid having to go read the Master File Table record (the i-node =20 equivalent) until it actually needs that information (such as the link =20= count). The ntfs_loadntnode() routine will read in the MFT record and =20= populate the rest of the vnode's fields. There's a flag =20 (VG_DONTLOADIN) to pass to ntfs_vgetex to control whether the MFT-=20 based fields get filled in when get the vnode. Hope this helps, -Mark From owner-freebsd-fs@FreeBSD.ORG Tue Oct 17 21:09:50 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 7A08916A40F for ; Tue, 17 Oct 2006 21:09:50 +0000 (UTC) (envelope-from lgusenet@be-well.ilk.org) Received: from mail8.sea5.speakeasy.net (mail8.sea5.speakeasy.net [69.17.117.10]) by mx1.FreeBSD.org (Postfix) with ESMTP id 7559143D76 for ; Tue, 17 Oct 2006 21:09:43 +0000 (GMT) (envelope-from lgusenet@be-well.ilk.org) Received: (qmail 21923 invoked from network); 17 Oct 2006 21:09:43 -0000 Received: from dsl092-078-145.bos1.dsl.speakeasy.net (HELO be-well.ilk.org) ([66.92.78.145]) (envelope-sender ) by mail8.sea5.speakeasy.net (qmail-ldap-1.03) with SMTP for ; 17 Oct 2006 21:09:43 -0000 Received: by be-well.ilk.org (Postfix, from userid 1147) id 813BC28433; Tue, 17 Oct 2006 17:09:42 -0400 (EDT) To: freebsd-fs@freebsd.org References: <200609292159.18282.absorbb@gmail.com> <44d58sos3a.fsf@be-well.ilk.org> From: Lowell Gilbert Date: Tue, 17 Oct 2006 17:09:42 -0400 In-Reply-To: (Mark Day's message of "Tue, 17 Oct 2006 11:13:31 -0700") Message-ID: <44d58qmyy1.fsf@be-well.ilk.org> User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.0.50 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Subject: Re: ntfs broken when share through samba3 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Oct 2006 21:09:50 -0000 mday@apple.com (Mark Day) writes: > On Oct 16, 2006, at 2:42 PM, Lowell Gilbert wrote: > >> absorbb@gmail.com (=D0=98=D0=BB=D1=8C=D0=B4=D0=B0=D1=80 =D0=9D=D1=83=D1= =80=D0=B8=D1=81=D0=BB=D0=B0=D0=BC=D0=BE=D0=B2) writes: >> >>> This old already reported bug. >>> But situation have'nt changed. >> >> For example, kern/86965. >> >>> There is very simple patch that fix this bug: >>> >>> --- usr/src/sys/fs/ntfs/ntfs_vnops.c Mon Mar 13 00:50:01 2006 >>> +++ home/voxel/stuff/ntfs_vnops.c Thu Aug 31 09:22:08 2006 >>> @@ -187,7 +187,8 @@ >>> vap->va_fsid =3D dev2udev(ip->i_dev); >>> vap->va_fileid =3D ip->i_number; >>> vap->va_mode =3D ip->i_mp->ntm_mode; >>> - vap->va_nlink =3D ip->i_nlink; >>> + vap->va_nlink =3D (ip->i_nlink ? ip->i_nlink : 1); >>> + //vap->va_nlink =3D ip->i_nlink; >>> vap->va_uid =3D ip->i_mp->ntm_uid; >>> vap->va_gid =3D ip->i_mp->ntm_gid; >>> vap->va_rdev =3D 0; /* XXX UNODEV ? */ >>> >>> but it seems to be not beaty solution >> >> Not beautiful, indeed. >> >> I was playing around with this, and although that change would work >> around the problem in (at least) most cases, I am not sure that it is >> truly correct. >> >> I am not an expert at filesystems, and certainly have little knowledge >> of NTFS. However, my observations confuse me considerably. The main >> issue is that if you read from a file (on NTFS, with a link count of >> zero according to ls(1)), the link count becomes populated. I cannot >> see how that would happen, because the ntnode structure link count is >> not modified except when reading the whole structure from the disk, >> and the on-disk node is not being changed. To confuse things further, >> the link count is changed to 2, not 1, on ordinary files that have >> only a single directory entry. I do not believe that streams are at >> issue, because the file has no open file descriptors remaining >> according to fstat(1). > > IIRC, the NTFS code tries to populate a vnode based on the limited > information present in the directory entries it sees. It's trying to > avoid having to go read the Master File Table record (the i-node > equivalent) until it actually needs that information (such as the link > count). The ntfs_loadntnode() routine will read in the MFT record and > populate the rest of the vnode's fields. There's a flag > (VG_DONTLOADIN) to pass to ntfs_vgetex to control whether the MFT-=20 > based fields get filled in when get the vnode. > > Hope this helps, Yes, that clears things up for me considerably. Thank you, Mark. One thing I can see from that is that the proper loading of the ntnode from the MFT will not be affected by faking a link count; the IN_LOADED flag will take care of that. Furthermore, using that same flag to add to this patch, so that a zero count in the MFT will not be ignored, will avoid the rest of my major concerns. The patch would end up more like: --- usr/src/sys/fs/ntfs/ntfs_vnops.c Mon Mar 13 00:50:01 2006 +++ home/voxel/stuff/ntfs_vnops.c Thu Aug 31 09:22:08 2006 @@ -187,7 +187,8 @@ vap->va_fsid =3D dev2udev(ip->i_dev); vap->va_fileid =3D ip->i_number; vap->va_mode =3D ip->i_mp->ntm_mode; - vap->va_nlink =3D ip->i_nlink; + vap->va_nlink =3D (ip->i_nlink || ip->i_flag & IN_LOADED ? ip->i_nlink := 1); vap->va_uid =3D ip->i_mp->ntm_uid; vap->va_gid =3D ip->i_mp->ntm_gid; vap->va_rdev =3D 0; /* XXX UNODEV ? */ This still doesn't meet POSIX requirements, but to do that would require reading the whole MFT entry every time, instead of just the directory entries. That optimization speeds things up a lot in large filename searches, so this seems like a good compromise to me. Or am I missing something? From owner-freebsd-fs@FreeBSD.ORG Wed Oct 18 06:40:51 2006 Return-Path: X-Original-To: fs@FreeBSD.org Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 0184E16A403; Wed, 18 Oct 2006 06:40:51 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout2.pacific.net.au (mailout2-3.pacific.net.au [61.8.2.226]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4143043D55; Wed, 18 Oct 2006 06:40:50 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.2.163]) by mailout2.pacific.net.au (Postfix) with ESMTP id 6FFE06E826; Wed, 18 Oct 2006 16:40:46 +1000 (EST) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailproxy2.pacific.net.au (Postfix) with ESMTP id 73A132744E; Wed, 18 Oct 2006 16:40:44 +1000 (EST) Date: Wed, 18 Oct 2006 16:40:43 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Chuck Lever In-Reply-To: <20061017113943.C67620@delplex.bde.org> Message-ID: <20061018153336.E72684@delplex.bde.org> References: <200610140725.k9E7PC37008454@repoman.freebsd.org> <20061014231502.GA38708@rink.nu> <20061015105809.M59123@delplex.bde.org> <20061015051044.GA42764@xor.obsecurity.org> <20061014222221.H97880@ns1.feral.com> <20061014222437.N4701@ns1.feral.com> <20061015153454.G59979@delplex.bde.org> <76bd70e30610150837w61689cf6ya2499d100a15c3e8@mail.gmail.com> <20061016164122.S63585@delplex.bde.org> <76bd70e30610160620x67e5d3a5j938c26744d0b9759@mail.gmail.com> <20061017113943.C67620@delplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: mjacob@FreeBSD.org, fs@FreeBSD.org, Kris Kennaway Subject: negative cache hits for nfs (was: cvs commit: ...) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Oct 2006 06:40:51 -0000 [I changed the Cc from cvs* to fs] On Tue, 17 Oct 2006, Bruce Evans wrote: > On Mon, 16 Oct 2006, Chuck Lever wrote: > >> On 10/16/06, Bruce Evans wrote: >>> On Sun, 15 Oct 2006, Chuck Lever wrote: >>>> [An independent imeout for the access cache isn't useful.] >>> >>> I'll try removing the special support for the access cache timeout in >>> rc.conf first. >> >> OK. I can review patches if you think that would help, but I can't >> contribute code at the moment because of IP issues at my current >> employer. Hopefully that will change soon. > > Thanks. Removing it in rc.conf won't require review :-). >> ... >> Another thing to consider is that a LOOKUP is usually more expensive >> for servers than a GETATTR. If your client has already cached lookup >> results for the file to be opened, you can get away with a GETATTR on >> the parent directory to verify that it has not changed, and that will >> almost always be faster than doing a full LOOKUP. > > FreeBSD's client is doing not very good things for Lookup too. It is > missing caching of negative lookups. make(1) likes to do a lot of > negative lookups... NetBSD fixed this in 1997, sigh. Here is a merge of some bits from NetBSD for review. It is mostly the 1997 version, with updates to use timespecs instead of time_t's, but not updates to use changes that don't seem to be related to correctness, or ones less than 18 months old (if any). % Index: nfs_vnops.c % =================================================================== % RCS file: /home/ncvs/src/sys/nfsclient/nfs_vnops.c,v % retrieving revision 1.270 % diff -u -2 -r1.270 nfs_vnops.c % --- nfs_vnops.c 14 Oct 2006 07:25:11 -0000 1.270 % +++ nfs_vnops.c 18 Oct 2006 01:41:14 -0000 % @@ -852,7 +869,17 @@ % return (error); % } % - if ((error = cache_lookup(dvp, vpp, cnp)) && error != ENOENT) { % + if ((error = cache_lookup(dvp, vpp, cnp)) != 0) { % struct vattr vattr; % % + if (error == ENOENT) { % + /* Negative cache hit. Use it unless stale. */ % + if (VOP_GETATTR(dvp, &vattr, cnp->cn_cred, td) == 0 && % + timespeccmp(&vattr.va_mtime, &np->n_nctime, ==)) % + return (ENOENT); % + % + cache_purge(dvp); % + timespecclear(&np->n_nctime); % + goto dorpc; % + } % newvp = *vpp; % if (!VOP_GETATTR(newvp, &vattr, cnp->cn_cred, td) % @@ -871,4 +898,5 @@ % *vpp = NULLVP; % } % +dorpc: % error = 0; % newvp = NULLVP; % @@ -951,4 +979,11 @@ % nfsmout: % if (error) { % + if (error == ENOENT && (cnp->cn_flags & MAKEENTRY) && % + cnp->cn_nameiop != CREATE) { % + /* Negative cache entry. */ % + if (!timespecisset(&np->n_nctime)) % + np->n_nctime = np->n_vattr.va_mtime; % + cache_enter(dvp, NULL, cnp); % + } % if (newvp != NULLVP) { % vput(newvp); % @@ -1931,6 +1966,9 @@ % if (newvp) % vput(newvp); % - } else % + } else { % + if (cnp->cn_flags & MAKEENTRY) % + cache_enter(dvp, newvp, cnp); % *ap->a_vpp = newvp; % + } % return (error); % } % Index: nfsnode.h % =================================================================== % RCS file: /home/ncvs/src/sys/nfsclient/nfsnode.h,v % retrieving revision 1.59 % diff -u -2 -r1.59 nfsnode.h % --- nfsnode.h 13 Sep 2006 18:39:09 -0000 1.59 % +++ nfsnode.h 18 Oct 2006 00:48:44 -0000 % @@ -100,4 +100,5 @@ % struct timespec n_mtime; /* Prev modify time. */ % time_t n_ctime; /* Prev create time. */ % + struct timespec n_nctime; /* Last neg cache entry (dir) */ % time_t n_expiry; /* Lease expiry time */ % nfsfh_t *n_fhp; /* NFS File Handle */ For building kernels, this gives a larger speedup than everything that I tried short of completely dropping cto consistency, provided dotdot caching in vfs_cache.c isn't lost. The following benchmarks are also with zapping of the attribute cache turned off in nfs_close() to avoid doubled Getattr's in open() without breaking cto consistency. Times and nfsstats are for the second run of "make depend; sync" and "make; sync " after "make clean cleandend; sync; sleep 1" in each run, with a RELENG_4 kernel sources, ~RELENG_5 userland and -current+ kernel, sources and obj and /usr on nfs, unloaded network latency 100uS, ... Before: 12.75 real 5.19 user 1.58 sys Lookup Read Write Access Getattr Other Total 14203 548 599 21561 454 97 37462 78.80 real 62.01 user 4.45 sys Lookup Read Write Create Access Fsstat Other Total 19543 2410 5353 442 24241 1742 14 53745 After: 10.68 real 5.20 user 1.42 sys Lookup Read Write Access Getattr Other Total 1268 548 599 21575 454 112 24556 76.38 real 62.00 user 4.28 sys Lookup Read Write Create Access Fsstat Other Total 4122 2410 5353 442 24222 1750 14 38313 The number of Lookups has been reduced by a factor of 11+ for make -n and 5- for make. With lost dotdot caching, After: 11.02 real 5.25 user 1.40 sys Lookup Read Write Access Getattr Other Total 3031 548 599 21574 453 112 26317 84.19 real 62.20 user 4.71 sys Lookup Read Write Create Access Fsstat Other Total 45063 2410 5353 442 24290 1750 14 79322 This does another 40000+ Lookups, much the same as Before, but ending up with 45000+ instead of 59000+. Of course things are slower with cold caches, but they aren't much slower (less than what losing dotdot caching costs). According to vfs.cache.numcache, there are less than 2000 files to look up, so even 4122 Lookups is a lot. (vfs.cache.numcache was 1482 and vfs.cache.numneg was 103 after the run that produced the above statistics. This includes a few other files looked up since rebooting a few minuts earlier. I wasn't careful about starting from scratch without dotdot caching). With cto consistency completely turned off, without lost dotdot caching, After: 7.54 real 5.17 user 1.13 sys Lookup Read Write Access Getattr Other Total 1284 548 599 2161 454 112 5158 72.67 real 61.95 user 3.82 sys Lookup Read Write Create Access Fsstat Other Total 4799 2410 5353 442 1529 1750 14 16297 This reduces the Access count by a factor of almost 18. Now another pessimization is more obvious -- why should building kernels be doing all those Fsstat's? I haven't located them for sure, but that opendir always calls statfs(2) for the silly purpose of determining whether it needs to do extra work to support unionfs. For comparison: >From now on, kernel source and obj are on a local file system. nfs is still on /usr, so there are a few RPCs for execing things. Before (without lost dotdot caching, with cto consistency): 6.39 real 5.04 user 1.08 sys Lookup Access Other Total 865 651 3 1519 66.86 real 61.42 user 4.45 sys Lookup Access Other Total 3115 1350 1 4466 Before (without lost dotdot caching, and _without_ cto consistency): 6.17 real 5.25 user 0.88 sys Other Total 86 86 66.61 real 61.54 user 4.74 sys Other Total 94 94 The differences that the nfs changes make with just /usr on nfs are measureable but tiny. At least in my configuration. All my executables are statically linked, else the extra opens for cto consistency of the shared libraries would give many more than 1350 Access RPCs. Bruce From owner-freebsd-fs@FreeBSD.ORG Wed Oct 18 16:20:23 2006 Return-Path: X-Original-To: fs@freebsd.org Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 8AF4616A40F for ; Wed, 18 Oct 2006 16:20:23 +0000 (UTC) (envelope-from chucklever@gmail.com) Received: from ug-out-1314.google.com (ug-out-1314.google.com [66.249.92.173]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0228843D55 for ; Wed, 18 Oct 2006 16:20:12 +0000 (GMT) (envelope-from chucklever@gmail.com) Received: by ug-out-1314.google.com with SMTP id m2so189706uge for ; Wed, 18 Oct 2006 09:20:12 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=EtG/n6vuRcapG4MZJ7GklWoR7PACska6J3SG2RY8lGRZd0UzCp/SBXJ+dmHEKxrTF6MQBmb0FiEKiVcQiuVj9xjFAwnLctfFTLFPz+CbtZLhnyaaBApGvcXO9FXWb7Nf35IhfraGxkL1tl/Vdga1qibI5786xFhUBKEtugT3rRY= Received: by 10.82.109.19 with SMTP id h19mr2350949buc; Wed, 18 Oct 2006 09:20:11 -0700 (PDT) Received: by 10.78.202.20 with HTTP; Wed, 18 Oct 2006 09:20:11 -0700 (PDT) Message-ID: <76bd70e30610180920m1918e84s8e1c3b0f02de712e@mail.gmail.com> Date: Wed, 18 Oct 2006 12:20:11 -0400 From: "Chuck Lever" To: "Bruce Evans" In-Reply-To: <20061018153336.E72684@delplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <200610140725.k9E7PC37008454@repoman.freebsd.org> <20061015051044.GA42764@xor.obsecurity.org> <20061014222221.H97880@ns1.feral.com> <20061014222437.N4701@ns1.feral.com> <20061015153454.G59979@delplex.bde.org> <76bd70e30610150837w61689cf6ya2499d100a15c3e8@mail.gmail.com> <20061016164122.S63585@delplex.bde.org> <76bd70e30610160620x67e5d3a5j938c26744d0b9759@mail.gmail.com> <20061017113943.C67620@delplex.bde.org> <20061018153336.E72684@delplex.bde.org> Cc: mjacob@freebsd.org, fs@freebsd.org, Kris Kennaway Subject: Re: negative cache hits for nfs (was: cvs commit: ...) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Oct 2006 16:20:23 -0000 Hi Bruce- On 10/18/06, Bruce Evans wrote: > Here is a merge of some bits from NetBSD for review. It is mostly the > 1997 version, with updates to use timespecs instead of time_t's, but > not updates to use changes that don't seem to be related to correctness, > or ones less than 18 months old (if any). > > % Index: nfs_vnops.c > % =================================================================== > % RCS file: /home/ncvs/src/sys/nfsclient/nfs_vnops.c,v > % retrieving revision 1.270 > % diff -u -2 -r1.270 nfs_vnops.c > % --- nfs_vnops.c 14 Oct 2006 07:25:11 -0000 1.270 > % +++ nfs_vnops.c 18 Oct 2006 01:41:14 -0000 > % @@ -852,7 +869,17 @@ > % return (error); > % } > % - if ((error = cache_lookup(dvp, vpp, cnp)) && error != ENOENT) { > % + if ((error = cache_lookup(dvp, vpp, cnp)) != 0) { > % struct vattr vattr; > % > % + if (error == ENOENT) { > % + /* Negative cache hit. Use it unless stale. */ > % + if (VOP_GETATTR(dvp, &vattr, cnp->cn_cred, td) == 0 && > % + timespeccmp(&vattr.va_mtime, &np->n_nctime, ==)) > % + return (ENOENT); > % + > % + cache_purge(dvp); > % + timespecclear(&np->n_nctime); > % + goto dorpc; > % + } > % newvp = *vpp; > % if (!VOP_GETATTR(newvp, &vattr, cnp->cn_cred, td) > % @@ -871,4 +898,5 @@ > % *vpp = NULLVP; > % } > % +dorpc: > % error = 0; > % newvp = NULLVP; > % @@ -951,4 +979,11 @@ > % nfsmout: > % if (error) { > % + if (error == ENOENT && (cnp->cn_flags & MAKEENTRY) && > % + cnp->cn_nameiop != CREATE) { > % + /* Negative cache entry. */ > % + if (!timespecisset(&np->n_nctime)) > % + np->n_nctime = np->n_vattr.va_mtime; > % + cache_enter(dvp, NULL, cnp); > % + } > % if (newvp != NULLVP) { > % vput(newvp); > % @@ -1931,6 +1966,9 @@ > % if (newvp) > % vput(newvp); > % - } else > % + } else { > % + if (cnp->cn_flags & MAKEENTRY) > % + cache_enter(dvp, newvp, cnp); > % *ap->a_vpp = newvp; > % + } > % return (error); > % } > % Index: nfsnode.h > % =================================================================== > % RCS file: /home/ncvs/src/sys/nfsclient/nfsnode.h,v > % retrieving revision 1.59 > % diff -u -2 -r1.59 nfsnode.h > % --- nfsnode.h 13 Sep 2006 18:39:09 -0000 1.59 > % +++ nfsnode.h 18 Oct 2006 00:48:44 -0000 > % @@ -100,4 +100,5 @@ > % struct timespec n_mtime; /* Prev modify time. */ > % time_t n_ctime; /* Prev create time. */ > % + struct timespec n_nctime; /* Last neg cache entry (dir) */ > % time_t n_expiry; /* Lease expiry time */ > % nfsfh_t *n_fhp; /* NFS File Handle */ This looks reasonable, but there are always tricky corner cases for this kind of change. The reason many of these changes haven't been made sooner is because they usually break something else. Now that you've verified the positive performance impact of this change, you should start by testing this with the Connectathon test suite (both NFSv2 and NFSv3). Another useful test in this case is to observe how the client behaves in the face of replaced file objects. Use another client to remove and recreate a file that your test client already has cached, or in this case create first after your client has already cached the negative lookup. Another interesting test case is to enable READDIRPLUS on NFSv3 mounts to see if there is any strange interaction with your patch. Additionally, I would hold off on putting this in 6.2. Try it in CURRENT for a while, then MFC it after 6.3 branches. Have you reviewed the NetBSD change logs for any fixes in this area since 1997? > Now another pessimization is more obvious -- why should building kernels > be doing all those Fsstat's? I haven't located them for sure, but > that opendir always calls statfs(2) for the silly purpose of determining > whether it needs to do extra work to support unionfs. I've also noticed that behavior. Certainly the FreeBSD client can do a better job of caching the results of Fsstat. -- "We who cut mere stones must always be envisioning cathedrals" -- Quarry worker's creed From owner-freebsd-fs@FreeBSD.ORG Thu Oct 19 03:10:40 2006 Return-Path: X-Original-To: fs@freebsd.org Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id B0D8E16A40F for ; Thu, 19 Oct 2006 03:10:40 +0000 (UTC) (envelope-from email2mre-bsdfs@yahoo.com) Received: from web31807.mail.mud.yahoo.com (web31807.mail.mud.yahoo.com [68.142.207.70]) by mx1.FreeBSD.org (Postfix) with SMTP id 4C0DB43D55 for ; Thu, 19 Oct 2006 03:10:40 +0000 (GMT) (envelope-from email2mre-bsdfs@yahoo.com) Received: (qmail 84530 invoked by uid 60001); 19 Oct 2006 03:10:39 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:Received:Date:From:Reply-To:Subject:To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=4DyWHvWQnJ1w6RB4GLOR/IoFBzySnWuZt0L0DbbkWiKg6cGaRD9JqWMkBgUMm9bUxUHvswFUJvr0zyZRP0vDmAHcdksq/E3fMhjUMWx/cv41iAQ/ISBiXZBTyCjHBIs0rms3CFFvsgMGhqnIJY2Z9SGjqr+DXglLwx1JTuNPja4= ; Message-ID: <20061019031039.84528.qmail@web31807.mail.mud.yahoo.com> Received: from [216.240.30.14] by web31807.mail.mud.yahoo.com via HTTP; Wed, 18 Oct 2006 20:10:39 PDT Date: Wed, 18 Oct 2006 20:10:39 -0700 (PDT) From: Mike Eisler To: fs@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit Cc: Subject: Re: lost dotdot caching pessimizes nfs especially X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: email2mre-bsdfs@yahoo.com List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Oct 2006 03:10:40 -0000 > Date: Mon, 16 Oct 2006 11:32:36 -0400 (EDT) > From: rick@snowhite.cis.uoguelph.ca > To: fs@freebsd.org > Subject: Re: lost dotdot caching pessimizes nfs especially > CC: mohan_srinivasan@yahoo.com > the lines of my experimental NQNFS might have happenned. NFSv4 simply > says that clients that care about cache consistency should use byte While RFC3530does not use the term "close-to-open consistency" (something that I'll address in the NFSv4.1 protocol), it does say: Furthermore, in the absence of open delegation (see the section "Open Delegation") two additional rules apply. Note that these rules are obeyed in practice by many NFS version 2 and version 3 clients. o First, cached data present on a client must be revalidated after doing an OPEN. Revalidating means that the client fetches the change attribute from the server, compares it with the cached change attribute, and if different, declares the cached data (as well as the cached attributes) as invalid. This is to ensure that the data for the OPENed file is still correctly reflected in the client's cache. This validation must be done at least when the client's OPEN operation includes DENY=WRITE or BOTH thus terminating a period in which other clients may have had the opportunity to open the file with WRITE access. Clients may choose to do the revalidation more often (i.e., at OPENs specifying DENY=NONE) to parallel the NFS version 3 protocol's practice for the benefit of users assuming this degree of cache revalidation. Since the change attribute is updated for data and metadata modifications, some client implementors may be tempted to use the time_modify attribute and not change to validate cached data, so that metadata changes do not spuriously invalidate clean data. The implementor is cautioned in this approach. The change attribute is guaranteed to change for each update to the file, whereas time_modify is guaranteed to change only at the granularity of the time_delta attribute. Use by the client's data cache validation logic of time_modify and not change runs the risk of the client incorrectly marking stale data as valid. o Second, modified data must be flushed to the server before closing a file OPENed for write. This is complementary to the first rule. If the data is not flushed at CLOSE, the revalidation done after client OPENs as file is unable to achieve its purpose. From owner-freebsd-fs@FreeBSD.ORG Thu Oct 19 11:10:10 2006 Return-Path: X-Original-To: fs@freebsd.org Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 1397F16A40F; Thu, 19 Oct 2006 11:10:10 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout1.pacific.net.au (mailout1-3.pacific.net.au [61.8.2.210]) by mx1.FreeBSD.org (Postfix) with ESMTP id 3798843D58; Thu, 19 Oct 2006 11:10:09 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au [61.8.2.162]) by mailout1.pacific.net.au (Postfix) with ESMTP id 7516D5AFFB2; Thu, 19 Oct 2006 21:10:07 +1000 (EST) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailproxy1.pacific.net.au (Postfix) with ESMTP id 98D938C08; Thu, 19 Oct 2006 21:10:04 +1000 (EST) Date: Thu, 19 Oct 2006 21:10:03 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Chuck Lever In-Reply-To: <76bd70e30610180920m1918e84s8e1c3b0f02de712e@mail.gmail.com> Message-ID: <20061019193110.I77123@delplex.bde.org> References: <200610140725.k9E7PC37008454@repoman.freebsd.org> <20061015051044.GA42764@xor.obsecurity.org> <20061014222221.H97880@ns1.feral.com> <20061014222437.N4701@ns1.feral.com> <20061015153454.G59979@delplex.bde.org> <76bd70e30610150837w61689cf6ya2499d100a15c3e8@mail.gmail.com> <20061016164122.S63585@delplex.bde.org> <76bd70e30610160620x67e5d3a5j938c26744d0b9759@mail.gmail.com> <20061017113943.C67620@delplex.bde.org> <20061018153336.E72684@delplex.bde.org> <76bd70e30610180920m1918e84s8e1c3b0f02de712e@mail.gmail.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: mjacob@freebsd.org, fs@freebsd.org, Kris Kennaway Subject: Re: negative cache hits for nfs (was: cvs commit: ...) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Oct 2006 11:10:10 -0000 On Wed, 18 Oct 2006, Chuck Lever wrote: > On 10/18/06, Bruce Evans wrote: >> Here is a merge of some bits from NetBSD for review. It is mostly the >> 1997 version, with updates to use timespecs instead of time_t's, but >> not updates to use changes that don't seem to be related to correctness, >> or ones less than 18 months old (if any). >> ... >> % + if (error == ENOENT) { >> % + /* Negative cache hit. Use it unless stale. */ >> % + if (VOP_GETATTR(dvp, &vattr, cnp->cn_cred, td) == 0 >> && >> % + timespeccmp(&vattr.va_mtime, &np->n_nctime, >> ==)) >> % + return (ENOENT); > Now that you've verified the positive performance impact of this > change, you should start by testing this with the Connectathon test > suite (both NFSv2 and NFSv3). Another useful test in this case is to > observe how the client behaves in the face of replaced file objects. Alas, I thought of a large problem without running any tests, ... > Use another client to remove and recreate a file that your test client > already has cached, or in this case create first after your client has > already cached the negative lookup. Another interesting test case is ... and before reading this carefully. The directory timestamp (written by VOP_GETTATTR() into vattr.va_mtime in the above) is normally cached. Thus it can be stale, and then the negative cache entry can be stale too but gets used. With the default min dir. attr timeout of 30seconds (BTW, why is this much larger than the min of 3 secondzs for non-dirs?), the problem is easy to see using shell commands typed not very quickly. I can't see how to fix this without an RPC to refresh the directory attributes, and this defeats the point of reducing RPCs. So the negative cache optimization works best when combined with the no-ctoc optimization -- then consistency for negative hits is no worse than for positive ones. Maybe a refresh for only the second-last component of the path would be enough (also, only if the lookup is for open) would be good enough. For positive dircache hits, directory attributes are normally never refreshed until the timeout, so everything above the last component of the path can change without us noticing and it is only the forced refresh for the last component on open that gives us a chance to see that the file has gone, moved or changed. For negative dircache hits, we want similar semantics for opens but don't have a vp for the last component to refresh. Second-lastness is harder to determine. > Have you reviewed the NetBSD change logs for any fixes in this area since > 1997? Some: - use a timespec instead of a time_t for the timestamp, since we already do that for related timestamps - don't get the comparison logic for the timespec backwards for 8 days - not merged: some distributed changes involving updating timestamps more often since those seem to be only optimizations and I'm not sure if they are safe. I now have some code for moving the attribute cache flush for close-to-open-consistency to the start of open() and removing it elsehwere. This was simple except for debugging and safety-belt parts, since there are already suitable flags for namei(). It interacts a bit with negative caching so I'll discuss it in the same thread now. % Index: nfs_vnops.c % =================================================================== % RCS file: /home/ncvs/src/sys/nfsclient/nfs_vnops.c,v % retrieving revision 1.270 % diff -u -2 -r1.270 nfs_vnops.c % --- nfs_vnops.c 14 Oct 2006 07:25:11 -0000 1.270 % +++ nfs_vnops.c 19 Oct 2006 00:03:23 -0000 % @@ -1,2 +1,4 @@ % +int nfs_ctoc = 1; /* Bogus global enable for cto consistency. */ % + % /*- % * Copyright (c) 1989, 1993 % @@ -462,5 +464,7 @@ % if (error == EINTR || error == EIO) % return (error); % + ASSERT_VOP_ELOCKED(vp, "nfs_open"); % np->n_attrstamp = 0; % + np->n_attrstampco = FALSE; % if (vp->v_type == VDIR) % np->n_direofoffset = 0; Locking in nfs is dubious. We mostly hold the exclusive vnode lock and shouldn't need much mutex locking, but it now has a lot. The ex.vn.lock is enough for my new flag, and the above code already holds the mutex for np so n_attrstamp has more than enough locking. % @@ -472,5 +476,23 @@ % mtx_unlock(&np->n_mtx); % } else { % - np->n_attrstamp = 0; % + ASSERT_VOP_ELOCKED(vp, "nfs_open"); % + if (nfs_ctoc && !np->n_attrstampco) { % + /* % + * XXX this should not be reached. It defends % + * against callers of namei() neglecting to set % + * ISOPEN for _all_ opens and against open() % + * dropping the exclusive vnode lock between % + * nfs_lookup() and here. Note that there can % + * be an nfs_create() call between nfs_open() % + * and here, and that the main benefit of % + * ensuring the attribute refresh earlier than % + * here is for the creation case. If the lock % + * does get dropped then we should use a timeout % + * here. % + */ % + printf("nfs_open: missing attrcache refresh\n"); % + np->n_attrstamp = 0; % + } % + np->n_attrstampco = FALSE; % mtx_unlock(&np->n_mtx); % error = VOP_GETATTR(vp, &vattr, ap->a_cred, ap->a_td); This hasn't been reached lately but took a long time to get right. When it is reached, it is equivalent to the old code except for its printf(). It refreshes the attributes just after they have been used, but since we refetch here there are only minor problems. % @@ -582,11 +604,4 @@ % mtx_lock(&np->n_mtx); % } % - /* % - * Invalidate the attribute cache in all cases. % - * An open is going to fetch fresh attrs any way, other procs % - * on this node that have file open will be forced to do an % - * otw attr fetch, but this is safe. % - */ % - np->n_attrstamp = 0; % if (np->n_flag & NWRITEERR) { % np->n_flag &= ~NWRITEERR; I think this can be safely removed without any other changes. We need to refresh on open(), and scheduling this refresh on close() doesn't really help. It doesn't do anything useful if it is on a different client to the open(), and on the same client it usually just gives an unnecessary RPC. % @@ -852,8 +867,24 @@ % return (error); % } % - if ((error = cache_lookup(dvp, vpp, cnp)) && error != ENOENT) { % + if ((error = cache_lookup(dvp, vpp, cnp)) != 0) { % struct vattr vattr; % % + if (error == ENOENT) { % + /* Negative cache hit. Use it unless stale. */ % + if (VOP_GETATTR(dvp, &vattr, cnp->cn_cred, td) == 0 && % + timespeccmp(&vattr.va_mtime, &np->n_nctime, ==)) % + return (ENOENT); % + % + cache_purge(dvp); % + timespecclear(&np->n_nctime); % + goto dorpc; % + } Negative cache hit code, same as before. % newvp = *vpp; % + if (nfs_ctoc && % + (flags & (ISLASTCN | ISOPEN)) == (ISLASTCN | ISOPEN)) { % + ASSERT_VOP_ELOCKED(vp, "nfs_lookup"); % + VTONFS(newvp)->n_attrstamp = 0; % + VTONFS(newvp)->n_attrstampco = TRUE; % + } % if (!VOP_GETATTR(newvp, &vattr, cnp->cn_cred, td) % && vattr.va_ctime.tv_sec == VTONFS(newvp)->n_ctime) { This is for a positve vfs cache hit. Lookups for open() are easily detected. I didn't bother locking np for the access to n_attrstamp. n_attrstampco records that we refreshed the attributes so nfs_open() shouldn't do it again. This is mainly a safety belt. vnode_open_cred() seems to be exclusively locked almost throughout, so we can safely pass n_attrstampco to nfs_open() like this, but with exclusive locking we can almost guarantee that nfs_open() doesn't need to refresh. % @@ -871,4 +902,5 @@ % *vpp = NULLVP; % } % +dorpc: % error = 0; % newvp = NULLVP; Negative cache hit code, same as before. % @@ -935,4 +967,10 @@ % newvp = NFSTOV(np); % } % + if (nfs_ctoc && % + (flags & (ISLASTCN | ISOPEN)) == (ISLASTCN | ISOPEN)) { % + ASSERT_VOP_ELOCKED(vp, "nfs_lookup"); % + VTONFS(newvp)->n_attrstamp = 0; % + VTONFS(newvp)->n_attrstampco = TRUE; % + } % if (v3) { % nfsm_postop_attr(newvp, attrflag); This is for a remote lookup. I think the attributes haven't been fetched yet so clearing n_attrstamp is not needed. % @@ -951,4 +989,11 @@ % nfsmout: % if (error) { % + if (error == ENOENT && (flags & MAKEENTRY) && % + cnp->cn_nameiop != CREATE) { % + /* Negative cache entry. */ % + if (!timespecisset(&np->n_nctime)) % + np->n_nctime = np->n_vattr.va_mtime; % + cache_enter(dvp, NULL, cnp); % + } % if (newvp != NULLVP) { % vput(newvp); Negative cache hit code, same as before. % @@ -1440,4 +1485,12 @@ % cache_enter(dvp, newvp, cnp); % *ap->a_vpp = newvp; % + if (nfs_ctoc) { % + /* % + * XXX this assumes that we will be followed by % + * nfs_open() immediately. % + */ % + ASSERT_VOP_ELOCKED(vp, "nfs_create"); % + VTONFS(newvp)->n_attrstampco = TRUE; % + } % } % mtx_lock(&(VTONFS(dvp))->n_mtx); This is in nfs_create(). Creation is interesting. nfs_lookup() fails, and if this was due to a negative cache hit the directory timestamp had better be right. Then vn_open_cred() calls here. We don't have a vp until near here so we can't set n_attrstampco until here. Optimization of creation is the main possible optimization for the ctoc case. When we have just created the file, it is silly for nfs_open() to force an attribute refresh a few microseconds later. (I haven't checked whether the file's cached attributes are ones we just wrote or ones we just refreshed after creation, and hope that whatever they are at this point is good enough.) This optimization reduces the number of Access RPCs for my kernel build benchmark by about 10% (from ~24000 to ~22000). Many of the negative cache hits for this benchmark are probably caused by creations, since the benchmark starts with the "make clean; ...; make depend" so the "make depend" part creates lots of negative cache hits for the files just removed. However, an incorrect negative cache hit seems most dangerous when the file is being created, so parts of the optimization are invalid -- the current and previous implementations of ctoc only refresh the attributes after possibly clobbering the old ones. % @@ -1931,6 +1984,9 @@ % if (newvp) % vput(newvp); % - } else % + } else { % + if (cnp->cn_flags & MAKEENTRY) % + cache_enter(dvp, newvp, cnp); % *ap->a_vpp = newvp; % + } % return (error); % } Part of negative cache hit code, same as before, but it is for future positve cache hits (for the directory crwated by mkdir()). % Index: nfsnode.h % =================================================================== % RCS file: /home/ncvs/src/sys/nfsclient/nfsnode.h,v % retrieving revision 1.59 % diff -u -2 -r1.59 nfsnode.h % --- nfsnode.h 13 Sep 2006 18:39:09 -0000 1.59 % +++ nfsnode.h 18 Oct 2006 10:09:19 -0000 % @@ -95,4 +95,5 @@ % struct vattr n_vattr; /* Vnode attribute cache */ % time_t n_attrstamp; /* Attr. cache timestamp */ % + int n_attrstampco; /* n_attrs. cleared on open */ % u_int32_t n_mode; /* ACCESS mode cache */ % uid_t n_modeuid; /* credentials having mode */ % @@ -100,4 +101,5 @@ % struct timespec n_mtime; /* Prev modify time. */ % time_t n_ctime; /* Prev create time. */ % + struct timespec n_nctime; /* Last neg cache entry (dir) */ % time_t n_expiry; /* Lease expiry time */ % nfsfh_t *n_fhp; /* NFS File Handle */ Bruce From owner-freebsd-fs@FreeBSD.ORG Thu Oct 19 15:28:19 2006 Return-Path: X-Original-To: fs@freebsd.org Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 8882E16A403 for ; Thu, 19 Oct 2006 15:28:19 +0000 (UTC) (envelope-from rick@snowhite.cis.uoguelph.ca) Received: from dargo.cs.uoguelph.ca (dargo.cs.uoguelph.ca [131.104.94.197]) by mx1.FreeBSD.org (Postfix) with ESMTP id 887B843D66 for ; Thu, 19 Oct 2006 15:28:10 +0000 (GMT) (envelope-from rick@snowhite.cis.uoguelph.ca) Received: from snowhite.cis.uoguelph.ca (snowhite.cis.uoguelph.ca [131.104.48.1]) by dargo.cs.uoguelph.ca (8.13.1/8.13.1) with ESMTP id k9JFS8Sq014581; Thu, 19 Oct 2006 11:28:08 -0400 Received: (from rick@localhost) by snowhite.cis.uoguelph.ca (8.9.3/8.9.3) id LAA03929; Thu, 19 Oct 2006 11:28:20 -0400 (EDT) Date: Thu, 19 Oct 2006 11:28:20 -0400 (EDT) From: rick@snowhite.cis.uoguelph.ca Message-Id: <200610191528.LAA03929@snowhite.cis.uoguelph.ca> To: fs@freebsd.org X-Scanned-By: MIMEDefang 2.57 on 131.104.94.197 Cc: Subject: Re: lost dotdot caching pessimizes nfs especially X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Oct 2006 15:28:19 -0000 [Mike wrote] > While RFC3530does not use the term "close-to-open consistency" > (something that I'll address in the NFSv4.1 protocol), it does say: > > Furthermore, in the absence of open delegation (see the section > "Open > Delegation") two additional rules apply. Note that these rules are > obeyed in practice by many NFS version 2 and version 3 clients. > > o First, cached data present on a client must be revalidated after > doing an OPEN. Revalidating means that the client fetches the > change attribute from the server, compares it with the cached > change attribute, and if different, declares the cached data (as > well as the cached attributes) as invalid. This is to ensure > that > the data for the OPENed file is still correctly reflected in the > client's cache. This validation must be done at least when the > client's OPEN operation includes DENY=WRITE or BOTH thus > terminating a period in which other clients may have had the > opportunity to open the file with WRITE access. Clients may > choose to do the revalidation more often (i.e., at OPENs > specifying DENY=NONE) to parallel the NFS version 3 protocol's > practice for the benefit of users assuming this degree of cache > revalidation. > > Since the change attribute is updated for data and metadata > modifications, some client implementors may be tempted to use the > time_modify attribute and not change to validate cached data, so > that metadata changes do not spuriously invalidate clean data. > The implementor is cautioned in this approach. The change > attribute is guaranteed to change for each update to the file, > whereas time_modify is guaranteed to change only at the > granularity of the time_delta attribute. Use by the client's > data > cache validation logic of time_modify and not change runs the > risk > of the client incorrectly marking stale data as valid. > > o Second, modified data must be flushed to the server before > closing > a file OPENed for write. This is complementary to the first > rule. > If the data is not flushed at CLOSE, the revalidation done after > client OPENs as file is unable to achieve its purpose. Oops, yes, this is in Sec. 9.3. I didn't spot it when I was emailing a few days ago, but do have code for it in my v4 client (so I must have read it at some point:-). (fyi, the RFC is only 275 pages) I'd argue that it contradicts the section I quoted in the last message I posted, but does spell it out in enough detail that it can't be misinterpreted, whereas the sections I quoted were kinda vague. I'll simply re-iterate that this isn't a problem if clients are using delegations efficiently, for v4, that is. (At this point, none of the v4 clients I am aware of are using delegations to avoid writes on close and checking against the server upon open, but hopefully it's being worked on. Since all the interest is in pNFS these days, I'm not so sure, but...) rick