From owner-freebsd-fs@FreeBSD.ORG  Tue Apr 14 15:52:35 2009
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 056D41065670
	for <freebsd-fs@freebsd.org>; Tue, 14 Apr 2009 15:52:35 +0000 (UTC)
	(envelope-from rmacklem@uoguelph.ca)
Received: from acadia.cs.uoguelph.ca (acadia.cs.uoguelph.ca [131.104.94.221])
	by mx1.freebsd.org (Postfix) with ESMTP id B63678FC12
	for <freebsd-fs@freebsd.org>; Tue, 14 Apr 2009 15:52:34 +0000 (UTC)
	(envelope-from rmacklem@uoguelph.ca)
Received: from muncher.cs.uoguelph.ca (muncher.cs.uoguelph.ca [131.104.91.102])
	by acadia.cs.uoguelph.ca (8.13.1/8.13.1) with ESMTP id n3EFqXxR008878; 
	Tue, 14 Apr 2009 11:52:33 -0400
Received: from localhost (rmacklem@localhost)
	by muncher.cs.uoguelph.ca (8.11.7p3+Sun/8.11.6) with ESMTP id
	n3EFx2S20621; Tue, 14 Apr 2009 11:59:02 -0400 (EDT)
X-Authentication-Warning: muncher.cs.uoguelph.ca: rmacklem owned process doing
	-bs
Date: Tue, 14 Apr 2009 11:59:02 -0400 (EDT)
From: Rick Macklem <rmacklem@uoguelph.ca>
X-X-Sender: rmacklem@muncher.cs.uoguelph.ca
To: Bruce Evans <brde@optusnet.com.au>
In-Reply-To: <20090414180826.J53102@delplex.bde.org>
Message-ID: <Pine.GSO.4.63.0904141110590.13047@muncher.cs.uoguelph.ca>
References: <Pine.GSO.4.63.0904121553050.3872@muncher.cs.uoguelph.ca>
	<20090413193936.A52183@delplex.bde.org>
	<Pine.GSO.4.63.0904131036310.27001@muncher.cs.uoguelph.ca>
	<20090414180826.J53102@delplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Scanned-By: MIMEDefang 2.63 on 131.104.94.221
Cc: freebsd-fs@freebsd.org
Subject: Re: changing semantics of the va_filerev (code review)
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 14 Apr 2009 15:52:35 -0000


On Tue, 14 Apr 2009, Bruce Evans wrote:

[stuff snipped]
>
> Oops, I missed that since nfs's use of i_gen is indirect.  What does
> nfs do for file systems that don't detect removed files, e.g., msdosfs.
> vptofh and fhtovp routines seem to have too many differences.  E.g.,
An nfs client can always think that a file exists for a short period of
time (until client side caches time out) after it has been removed locally
or by another client, on the server. The more serious failure occurs when
the i-node/directory entry gets reallocated. At that point, the client 
might access the attributes/data of the new file, thinking it was the old
file. (In the worst case, this could persist until the client does a
umount() of the file system.)

However, typically, unless it has the file open when the file is removed
locally on the server or by another client, nothing nasty will happen.
(And I think if the client has name caching disabled, nothing nasty can
happen.)

At least, that's my best guess at an answer.

> file systems based on ffs return ESTALE for removed files, but
> zfs_fhtovp() returns EINVAL.
>
I don't know why zfs would choose a different errno, but I don't think
that a different errno will have much effect. It's a terminal error
in either case. (I can't think of anything clever that a client can
do for ESTALE. I wouldn't be surprised if some clients end up translating
ESTALE to EINVAL, since POSIX apps don't expect ESTALE.)

I suppose someone could argue it violates the RFC, but only if they know
that the server should generate NFS3ERR_STALE for that case.

> I just noticed than the increment of i_gen was slightly broken for ffs
> by a type mismatch in ffs2 (affects ffs1 too).  Originally, i_gen had
> the same type as di_gen (int32_t).  Now i_gen has type int64_t but in
> ffs1, di_gen of course still has type int32_t, and in ffs2, di_gen
> still has type int32_t (apparently there was insufficient space to
> expand it).  This makes the overflow check in ffs_alloc.c (++ip->i_gen
> == 0) more broken than before.  Previously it only gave undefined
> behaviour followed by a bogus check when overflow occurs for incrementing
> from INT32_T_MAX.  Now it has no effect, since it takes 293 years of
> incrementing at a rate of 1GHz to reach overflow at INT64_T_MAX.
> Overflow now occurs on assignment to di_gen.
>
> The result of this bug is almost the the same as removing the silly
> part of the security code -- the re-randomization on overflow.  i_gen
> may grow larger than UINT32_T_MAX, but usually refresh from the dinode
> will keep it smaller.  When it starts near UINT32_T_MAX and grows
> larger, the overflow on assignment and a subsequent refresh will make
> it nearly 0.  Except, in 1 in every 2**32 cases, when the overflow makes
> di_gen exactly 0, the subsequent refresh will randomize i_gen.
>
Sounds like you have a better understanding of this than I. Since all
nfs really cares about is that the value of i_gen has changed after
the i-node is re-allocated, I doubt this causes grief in practice.
Personally, I'd just leave it as a 32bit number and initialize it
to some pseudo-random value in a range that is a small fraction of
UINT32_T_MAX (maybe 1<->1000000) if it is 0, otherwise just increment
it by a small value. (I've already noted that I'm not a big fan of
security by obscurity anyhow:-)

>>> va_ctime should give what you want for all file systems, since it
>>> should be increased whenever anything changes.  However, most file
>> There are some places where IN_UPDATE gets set, but IN_CHANGE doesn't.
>
> Are there?  This would be a bug.  I checked that ffs doesn't have this
> bug.
>
Oops, my mistake. I grep'd again and see it is IN_CHANGE that gets set
without IN_UPDATE and not the other way around, which makes sense, since
I can't think of how you can modify the data without modifying some
attribute.

So, the Change attribute only needs to change for IN_CHANGE (with all 
those uses of "change", it must be good:-). Thanks for pointing this out.

>
> They need to be fixed or faked well enough for make(1) too.
>
> When the dinode has no space to spare, something can be done by keeping
> state in the inode or vnode.  This won't work across reboots of course
> (except by hashing a reboot counter into the generation counts or
> timestamps) but might be enough for all short-term uses.  I'm not sure
> how much is safe here.
>
Yes, definitely. I think doing something like having an in-memory field
for va_filerev/i_modrev where the high order bits are initialized by
ctime (using whatever bits are valid, given tod clock resolution) when 
read in and then incrementing by 1 for each change, would be a good 
compromise.

rick