From owner-freebsd-fs@FreeBSD.ORG Tue Apr 14 15:52:35 2009 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 056D41065670 for ; Tue, 14 Apr 2009 15:52:35 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from acadia.cs.uoguelph.ca (acadia.cs.uoguelph.ca [131.104.94.221]) by mx1.freebsd.org (Postfix) with ESMTP id B63678FC12 for ; Tue, 14 Apr 2009 15:52:34 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from muncher.cs.uoguelph.ca (muncher.cs.uoguelph.ca [131.104.91.102]) by acadia.cs.uoguelph.ca (8.13.1/8.13.1) with ESMTP id n3EFqXxR008878; Tue, 14 Apr 2009 11:52:33 -0400 Received: from localhost (rmacklem@localhost) by muncher.cs.uoguelph.ca (8.11.7p3+Sun/8.11.6) with ESMTP id n3EFx2S20621; Tue, 14 Apr 2009 11:59:02 -0400 (EDT) X-Authentication-Warning: muncher.cs.uoguelph.ca: rmacklem owned process doing -bs Date: Tue, 14 Apr 2009 11:59:02 -0400 (EDT) From: Rick Macklem X-X-Sender: rmacklem@muncher.cs.uoguelph.ca To: Bruce Evans In-Reply-To: <20090414180826.J53102@delplex.bde.org> Message-ID: References: <20090413193936.A52183@delplex.bde.org> <20090414180826.J53102@delplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Scanned-By: MIMEDefang 2.63 on 131.104.94.221 Cc: freebsd-fs@freebsd.org Subject: Re: changing semantics of the va_filerev (code review) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 14 Apr 2009 15:52:35 -0000 On Tue, 14 Apr 2009, Bruce Evans wrote: [stuff snipped] > > Oops, I missed that since nfs's use of i_gen is indirect. What does > nfs do for file systems that don't detect removed files, e.g., msdosfs. > vptofh and fhtovp routines seem to have too many differences. E.g., An nfs client can always think that a file exists for a short period of time (until client side caches time out) after it has been removed locally or by another client, on the server. The more serious failure occurs when the i-node/directory entry gets reallocated. At that point, the client might access the attributes/data of the new file, thinking it was the old file. (In the worst case, this could persist until the client does a umount() of the file system.) However, typically, unless it has the file open when the file is removed locally on the server or by another client, nothing nasty will happen. (And I think if the client has name caching disabled, nothing nasty can happen.) At least, that's my best guess at an answer. > file systems based on ffs return ESTALE for removed files, but > zfs_fhtovp() returns EINVAL. > I don't know why zfs would choose a different errno, but I don't think that a different errno will have much effect. It's a terminal error in either case. (I can't think of anything clever that a client can do for ESTALE. I wouldn't be surprised if some clients end up translating ESTALE to EINVAL, since POSIX apps don't expect ESTALE.) I suppose someone could argue it violates the RFC, but only if they know that the server should generate NFS3ERR_STALE for that case. > I just noticed than the increment of i_gen was slightly broken for ffs > by a type mismatch in ffs2 (affects ffs1 too). Originally, i_gen had > the same type as di_gen (int32_t). Now i_gen has type int64_t but in > ffs1, di_gen of course still has type int32_t, and in ffs2, di_gen > still has type int32_t (apparently there was insufficient space to > expand it). This makes the overflow check in ffs_alloc.c (++ip->i_gen > == 0) more broken than before. Previously it only gave undefined > behaviour followed by a bogus check when overflow occurs for incrementing > from INT32_T_MAX. Now it has no effect, since it takes 293 years of > incrementing at a rate of 1GHz to reach overflow at INT64_T_MAX. > Overflow now occurs on assignment to di_gen. > > The result of this bug is almost the the same as removing the silly > part of the security code -- the re-randomization on overflow. i_gen > may grow larger than UINT32_T_MAX, but usually refresh from the dinode > will keep it smaller. When it starts near UINT32_T_MAX and grows > larger, the overflow on assignment and a subsequent refresh will make > it nearly 0. Except, in 1 in every 2**32 cases, when the overflow makes > di_gen exactly 0, the subsequent refresh will randomize i_gen. > Sounds like you have a better understanding of this than I. Since all nfs really cares about is that the value of i_gen has changed after the i-node is re-allocated, I doubt this causes grief in practice. Personally, I'd just leave it as a 32bit number and initialize it to some pseudo-random value in a range that is a small fraction of UINT32_T_MAX (maybe 1<->1000000) if it is 0, otherwise just increment it by a small value. (I've already noted that I'm not a big fan of security by obscurity anyhow:-) >>> va_ctime should give what you want for all file systems, since it >>> should be increased whenever anything changes. However, most file >> There are some places where IN_UPDATE gets set, but IN_CHANGE doesn't. > > Are there? This would be a bug. I checked that ffs doesn't have this > bug. > Oops, my mistake. I grep'd again and see it is IN_CHANGE that gets set without IN_UPDATE and not the other way around, which makes sense, since I can't think of how you can modify the data without modifying some attribute. So, the Change attribute only needs to change for IN_CHANGE (with all those uses of "change", it must be good:-). Thanks for pointing this out. > > They need to be fixed or faked well enough for make(1) too. > > When the dinode has no space to spare, something can be done by keeping > state in the inode or vnode. This won't work across reboots of course > (except by hashing a reboot counter into the generation counts or > timestamps) but might be enough for all short-term uses. I'm not sure > how much is safe here. > Yes, definitely. I think doing something like having an in-memory field for va_filerev/i_modrev where the high order bits are initialized by ctime (using whatever bits are valid, given tod clock resolution) when read in and then incrementing by 1 for each change, would be a good compromise. rick