Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 15 Dec 1995 13:02:16 -0700 (MST)
From:      Terry Lambert <terry@lambert.org>
To:        dfr@render.com (Doug Rabson)
Cc:        terry@lambert.org, current@freebsd.org
Subject:   Re: VOP_READIR revisited
Message-ID:  <199512152002.NAA04666@phaeton.artisoft.com>
In-Reply-To: <Pine.BSF.3.91.951215112604.457L-100000@minnow.render.com> from "Doug Rabson" at Dec 15, 95 12:15:08 pm

next in thread | previous in thread | raw e-mail | index | archive | help
> > I know that.  The search restart isn't an issue except in the NFS case.
> > 
> > ...well, and the telldir/seekdir/readdir case, which is currently broken
> > according to existing practice, but technically correct according to
> > the documentation.  It's broken *because* it doesn't use cookies or
> > some other restart mechanism.
> 
> I think it is technically broken according to the documentation but the 
> documentation implies behaviour which is wrong IMHO.  In any case the 
> abnormal usage which WINE made of this has been changed to something more 
> conservative (assuming they integrate Julian's patch that is).

I agree with this.  I'd like that case to operate as it does on SunOS,
since that's as close to a "traditional" behaviour as we'll be able to
find.

I think that by making the telldir token a 32 bit value instead of a
token will fix the problem instantly.  No malloc so no need to maintain
nor free a token list.

Actually, there's no reason the current implementation can't be fixed
like this using the bit vector translation mechanism I suggested before.
Unfortunately, telldir is documented as returning a 'long', not an off_t,
so that places a 26 (+1) bit limit on the magnitued of the index, and
thus the directory size.

For a token conforming to the documentation, where an intervening
directory compaction could screw up the usability of the token because
of directory compaction (this limit arises because the directory block
buffer in core is a snapshot, taken atomically, of the buffer on disk),
a simple linear 32 bit offset can be used, using the off_t from the
actual fd open on the dir before the getdirentries() call, plus the
offset in the block snapshot of the in core block relative to its
start address.  Then using the find-block-scan-forward-to-offset-and-
backof-on-miss approach (which would work right now in place of the
cookies, if you were willing to accept the occasional duplicate entry
in the compaction case), the token can be used to restart the search
after the dir has been reopened.

The biggest danger here is handling the directory truncation which
4.4BSD added to the directory compaction.


> Don't be too sure you will win by shaving a couple of arguments off the 
> function call.  All the memory involved is in the cache and writing NULL 
> to a cached memory location is virtually free.

I'm considering the additional overhead of the push of the call arguments
as well as the stack structure element assigns in the VOP_READDIR proper
(which I assume is what you are talking about?).  There is also the cycles
burned on the compare/jump in the non-cookie case, and the fact that the
code generation is complicated by having another couple of values that
it will have to invalidate registers on to generate the code (there is
no way of hinting to the compiler that the cookie variables are referenced
in the least common case).

> I don't agree that replacing a function which reads and parses a 
> directory block in one operation with two functions, one which reads a 
> block and one which parses it is code simplification.

Logical or algorithmic simplification?  The two are not tightly bound.

Basically, this is intended only to delay the disk-directory-contents
to user-snapshot-of-disk-directory-contents-in-user-format.  Nothing
more.


> > So let's concentrate on the benefits:
> > 
> > The primary benefit is a search restart without use of cookies.
> > 
> > The secondary benefit is an elimination of the malloc in the "on disk
> > directory entry larger than the 'cannonical' directory entry" case.
> 
> How is the malloc eliminated?  Surely the caller will have to first 
> malloc space for reading the fs-specific directory block and then parse 
> that into getdirentries format.  It seems to me that in the most common 
> case (UFS), the caller will do *more* work, not less.

Good question.

It's eliminated because the buffer pointer in the directory vnode is
what is passed around to use in the translation to the user buffer.

If I VOP_READDIR a directory block from a CDROM, for instance, the
directory block is put in core in a buffer referenced off the vnode.

I can later use this buffer directly fro my source when translating the
entries to user space.

This is actually one case where prevalidation of the target of a
copyout is a good idea: it avoids multiple validations in the copyout
case.  So a local stack "struct dirent *" is all that's used.

> > How do we eliminate restart overhead (cookies)?

[ ... my stuff elided ... ]

> So the client would do something like:
> 
> 	struct uio auio;
> 	struct iovec aiov;
> 	off_t new_offset;
> 	caddr_t buf;
> 
> 	/*
> 	 * Setup to read the raw directory block.
> 	 */
> 	buf = aiov.iov_base = malloc(dir_block_size);
> 	aiov.iov_len = dir_block_size;
> 	auio.uio_iov = &aiov;
> 	auio.uio_iovcnt = 1;
> 	auio.uio_rw = UIO_READ;
> 	auio.uio_segflg = UIO_SYSSPACE;
> 	auio.uio_procp = p;
> 	auio.uio_resid = dir_block_size;
> 	auio.uio_offset = (fp->f_offset >> entry_bits) << dir_block_shift;
> 
> 	/*
> 	 * Read the block
> 	 */
> 	error = VOP_READDIR(vp, &auio, ...);
> 
> 	/*
> 	 * Parse the block into the user's memory, returning the new offset
> 	 * where reading can be restarted.  Probably use another struct uio
> 	 * here to allow converting into both sys and user space.
> 	 */
> 	error = VOP_DIRCVT(vp, buf, uap->buf, uap->count, &new_offset);
> 
> How do you propose avoiding the extra copy for UFS where the directory 
> does not need converting?

A copy from the kernel space buffer to the user space buffer must always
take place, unless you start being tricky about process space mapping.

The sneaky bit is that the VOP_READDIR doesn't ever do a copy, it only
faults in the directory block in question (if it's not in core) or moves
it to the top of the LRU (if it is).  The VOP_DIRCVT is the one that
takes the uio: the purpose of a uio in the directory case is to describe
a movement of data from kernel to user space.  So your second comment
is correct.


The side benefit of handling a directory block that contains more entries
than can be returned in a user directory block falls out from the fact
that we have changed the restart mechanism: now it is possible to take
a single disk directory block and parse it into multiple user dirent
entries in the user supplied buffer.  The restart mechanism means that
we can do this "safely".


The next question I'd have, if I were reading this for the first time,
is "why not just make the VOP_READDIR act this way itself, throwing out
the old VOP_READDIR and the new VOP_DIRCVT.

The answer to this is non-obvious: by decoupling the operations, you
allow the copy operation to be "delayed".  This is an issue of FS
layering, more than anything else.  If I stack several file system
layers on top of each other, the underlying vnode containing the
directory is the one I am interested in having translated.  So the
VOP_READDIR *also* takes a (struct vnode **) to be filled out by the
underlying FS that answers the VOP_READDIR request.  The VOP_DIRCVT
acts on this vnode.  For the majority of cases, this is the same
vnode as the call-down interface for the top level directory being
read, since most FS's are single layer (or integrated layer, like
those which use ufs and have statically placed ufs_* calls in their
VOP tables).  But in the nullfs, lofs, or hypothetical compression
or encryption layers, etc., the vp is that of the underlying layer
and thus the VOP_DIRCVT associated with the vp is the correct one
for the data being retrieved.

> Doug Rabson, Microsoft RenderMorphics Ltd.	Mail:  dfr@render.com

PS: Hey!   You guys got bought it looks like!


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199512152002.NAA04666>