Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 24 Jul 95 15:36:48 MDT
From:      terry@cs.weber.edu (Terry Lambert)
To:        dfr@render.com (Doug Rabson)
Cc:        peter@haywire.dialix.com, freebsd-current@freebsd.org
Subject:   Re: what's going on here? (NFSv3 problem?)
Message-ID:  <9507242136.AA09885@cs.weber.edu>
In-Reply-To: <Pine.BSF.3.91.950724112353.12542B-100000@minnow.render.com> from "Doug Rabson" at Jul 24, 95 11:50:14 am

next in thread | previous in thread | raw e-mail | index | archive | help
> > Most file systems do not provide a generation count on directory blocks
> > with which to validate the "cookie".
> > 
> > With that in mind, the "cookie" is typically interpreted either as an
> > entry offset or as a byte offset of entry, either in the block or in
> > the directory.
> 
> The NFSv3 code in -current uses the modification time of the directory as 
> the verifier.  This is perhaps a slightly pessimistic solution but it 
> should detect compactions.  The client reacts to bad cookie errors by 
> flushing its cached information about the directory.  This seems to be a 
> reasonable reaction to the directory being modified.
> 
> Can the ufs code ever compact a directory block *without* the 
> modification time of the directory changing?  Presumably it only ever 
> does this as a result of some other operation on the directory.

This is a good question.

I can answer it in terms of existing practice and in terms of POSIX time
update semantics requirements.

For UFS, the compaction can only take place when an entry has failed
lookup during creation and is therefore being created (ie: with the
directory locked).

That is, a directory data modification is involved.

Does this mean that directory times will be updated?

Under POSIX, it does not.  The modification time update semantics in
POSIX are file-bound.  That is, one is not required to update the
times for directories the same as one is required to update the times
for files.  The single exception to this is the directory read
operations which must *mark for update* the access time.  Note that
this does not require that it have been updated by the time a subsequent
access has taken place.

We can easily envision compaction in a DOS style directory (after all,
this is what Win95 does in order to support long names, effectively),
where since the file names are attributes of the file rather than real
directory contents, such compaction does *not* cause the directory to
be even marked for update!

That is, depending on this behaviour has existing failure modes for
non-POSIX file systems in any case.

I think it is a mistake to assume that the NFS exporting of file
systems should only work when the NFS export is a client of POSIX
file system services (and even then, it depends on "mark for update"
referring to a change of the in core time stamp rather than a real
marking by flagging the in core and on disk times to be updated at
dirty page discard time -- assuming a directory is implemented as a
file at all instead of being considered a logically seperate entity).

All that said, yes, in UFS, it happens to work.  Currently.  8-/.

> > The stat structure passed around internally is larger than the stat
> > structure expected by NFS.
> > 
> > Rather than fix the view of things at the time it was exported to
> > NFS, the internal buffer representation for all file system capable
> > of being exported was changed.
> > 
> > I can't say I'm not glad that this is coming back to haunt us.
> 
> At the time, I was more interested in fixing the completely stupid 
> assumption the NFS server was making about the FS implementation which 
> only ever worked for UFS.  Adding a whole new layer of code between NFS 
> and the VFS would have added maintenance problems, consistency problems 
> (we would be caching directory information; when is the cache invalid?  
> when should stuff be removed from it?) and needless complication.

I think the cache issue is seperate.  Specifically, directory caching
should be generalized externally to the file system implementations
themselves.  Potentially, it should even be a seperate layer, although
the only thing dictating that would be the lack of a filesystem initiated
cache callback mechanism for ensuring coherency.  Even then, that's a
problem with the file system under the cache and should be handled in
the file system implementation rather than being hacked around by adding
function call layering everywhere so that it can be omitted for file
systems that might undergo promiscuous changes (ie: NFS, AFS).

The assumptions that NFS made were, indeed *wrong*.  But since the issue
was FS implementation independent metadata presentation, the fact is
that the complication would have been purely NFS's problem -- and at
that, it's caused by the statelessness NFS insists on maintaining.
A presentation layer would have added a single function call overhead to
the NFS based ops -- and avoided the buffer size implications that got
strewn about as a result.  The layer itself would have disappeared,
all but the single function call dealing with stat, when the call
graph was collapsed in creating the file system instance.

Admittedly, this would have meant dealing with some of the messier
stackability isses then, rather than later.

The other alternative would have been to put off the stackability
issues until later and to eat two copies in the NFS layer (and some
stack allocated buffer space).  This actually wouldn't have been that
big of a hit to take in any case, since the bottleneck is the network
(relative to the extra copy time).

Either way, it's really water under the bridge, although I'm going to
be beating on some of the stackability issues in the near future; in
particular, moving the directory cache up to the vnode rather than the
inode layer and going to a per FS type vnode pool to overcome the
inode space limitations imposed by common inode allocation both need
to happen in the near future.  Luckily USL has kindly documented SVR4
DNLC (a vnode based directory name lookup cache) for us, though it
is missing the ability to keep negative cache entries (i'll fix that,
though; it's relatively easy even in the USL code).

The stackability issues must be resolved to support both user space
file system developement and source level debugging, and to allow for
general support of a per block file compression layer that operates
only on files, not directories.

> I added code as part of this fix which would deal with unaligned UFS
> directory reads, more or less on the lines of the approach you suggested.  

I noticed (and appreciate!) the code there.  It helps the restart stuff
immensely.  The code pretty much has to be there for a VM86() based
INT 21 redirector to map UFS volumes as DOS drives under VM86() based
DOS emulation in any case.  The lack of an opendir/closedir type paradigm
in the DOS FindFirst/FindNext directory scanning routines makes this
especially necessary, unless we wanted to keep around LRU lists of
some finite number of contexts for DOS searches outstanding (what Novell
does in their DOS redirector).

It also allows a "DOS porting interface" for DOS code that does INT 21
access, if the interface is exported at the FS system call layer by using
a VFS layer specific ioctl() for FindFirst/FindNext.

Wine wants this kind of portability API.

> The FS reads from the aligned address.  NFS then finds from the information
> returned by the FS the first entry whose cookie is greater or equal to the
> cookie sent by the client.  The only restriction this places on VFS for
> directory cookies is that they increase monotonically in a directory. 

This is The Right Way.  8-).

> In the case of a compacted directory block, the client may recieve 
> filenames it has already seen or it may miss a few entries.  It will 
> never recieve corrupt information.

Right.  I believe it is the responsibility of the client to deal with
this fact.  Otherwise we are screwed at the outset regarding kernel
preemption and SMP kernel reentrancy, both short-term issues in terms
of the need to provide file system multithreading.

So I definitely don't have problems with that code.

> The current v2 server has an adequate strategy for dealing with directory
> compaction for all read sizes, IMHO.  The directory verifier is *not*
> optional in NFSv3.  The only optional part AFAIK is the use of READDIRPLUS by
> the client to read file attributes with the names.  Both READDIR and
> READDIRPLUS *must* implement a verifier strategy. 

I'm much less concerned with the client side of things, but the verifier
*does* prevent the server from being a chincy minimal implementation, and
that's the important thing.  It remains to be seen if using the date
as the verifier is really a valid thing to do for non-POSIX compliant
file system implementaions -- or POSIX complient implementations where
the directory is not a file (NT, VMS, etc.).  I think the answer must be
"no".

> A server *can* choose to return zero for a verifier but only if the 
> cookies it generates are *always* valid, e.g. for read-only media.  From 
> rfc1813, section 3.3.16:
> 
>       One implementation of the cookie-verifier mechanism might
>       be for the server to use the modification time of the
>       directory. This might be overly restrictive, however. A
>       better approach would be to record the time of the last
>       directory modification that changed the directory
>       organization in a way that would make it impossible to
>       reliably interpret a cookie. Servers in which directory
>       cookies are always valid are free to use zero as the
>       verifier always.

Yes.  This speaks to the organization (or rather the lack of it) in the
VFS framework regarding directory vs. file operations.  Specifically,
it should be possible to get even callbacks ala Andrew at the presentation
layer such that file system events that affect NFS exported volumes are
in fact propagated to the NFS layer so it can act appropriately.

Obviously, this code isn't there yet. 8-(.


					Regards,
					Terry Lambert
					terry@cs.weber.edu
---
Any opinions in this posting are my own and not those of my present
or previous employers.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?9507242136.AA09885>