Date: Fri, 15 Dec 1995 13:02:16 -0700 (MST) From: Terry Lambert <terry@lambert.org> To: dfr@render.com (Doug Rabson) Cc: terry@lambert.org, current@freebsd.org Subject: Re: VOP_READIR revisited Message-ID: <199512152002.NAA04666@phaeton.artisoft.com> In-Reply-To: <Pine.BSF.3.91.951215112604.457L-100000@minnow.render.com> from "Doug Rabson" at Dec 15, 95 12:15:08 pm
next in thread | previous in thread | raw e-mail | index | archive | help
> > I know that. The search restart isn't an issue except in the NFS case. > > > > ...well, and the telldir/seekdir/readdir case, which is currently broken > > according to existing practice, but technically correct according to > > the documentation. It's broken *because* it doesn't use cookies or > > some other restart mechanism. > > I think it is technically broken according to the documentation but the > documentation implies behaviour which is wrong IMHO. In any case the > abnormal usage which WINE made of this has been changed to something more > conservative (assuming they integrate Julian's patch that is). I agree with this. I'd like that case to operate as it does on SunOS, since that's as close to a "traditional" behaviour as we'll be able to find. I think that by making the telldir token a 32 bit value instead of a token will fix the problem instantly. No malloc so no need to maintain nor free a token list. Actually, there's no reason the current implementation can't be fixed like this using the bit vector translation mechanism I suggested before. Unfortunately, telldir is documented as returning a 'long', not an off_t, so that places a 26 (+1) bit limit on the magnitued of the index, and thus the directory size. For a token conforming to the documentation, where an intervening directory compaction could screw up the usability of the token because of directory compaction (this limit arises because the directory block buffer in core is a snapshot, taken atomically, of the buffer on disk), a simple linear 32 bit offset can be used, using the off_t from the actual fd open on the dir before the getdirentries() call, plus the offset in the block snapshot of the in core block relative to its start address. Then using the find-block-scan-forward-to-offset-and- backof-on-miss approach (which would work right now in place of the cookies, if you were willing to accept the occasional duplicate entry in the compaction case), the token can be used to restart the search after the dir has been reopened. The biggest danger here is handling the directory truncation which 4.4BSD added to the directory compaction. > Don't be too sure you will win by shaving a couple of arguments off the > function call. All the memory involved is in the cache and writing NULL > to a cached memory location is virtually free. I'm considering the additional overhead of the push of the call arguments as well as the stack structure element assigns in the VOP_READDIR proper (which I assume is what you are talking about?). There is also the cycles burned on the compare/jump in the non-cookie case, and the fact that the code generation is complicated by having another couple of values that it will have to invalidate registers on to generate the code (there is no way of hinting to the compiler that the cookie variables are referenced in the least common case). > I don't agree that replacing a function which reads and parses a > directory block in one operation with two functions, one which reads a > block and one which parses it is code simplification. Logical or algorithmic simplification? The two are not tightly bound. Basically, this is intended only to delay the disk-directory-contents to user-snapshot-of-disk-directory-contents-in-user-format. Nothing more. > > So let's concentrate on the benefits: > > > > The primary benefit is a search restart without use of cookies. > > > > The secondary benefit is an elimination of the malloc in the "on disk > > directory entry larger than the 'cannonical' directory entry" case. > > How is the malloc eliminated? Surely the caller will have to first > malloc space for reading the fs-specific directory block and then parse > that into getdirentries format. It seems to me that in the most common > case (UFS), the caller will do *more* work, not less. Good question. It's eliminated because the buffer pointer in the directory vnode is what is passed around to use in the translation to the user buffer. If I VOP_READDIR a directory block from a CDROM, for instance, the directory block is put in core in a buffer referenced off the vnode. I can later use this buffer directly fro my source when translating the entries to user space. This is actually one case where prevalidation of the target of a copyout is a good idea: it avoids multiple validations in the copyout case. So a local stack "struct dirent *" is all that's used. > > How do we eliminate restart overhead (cookies)? [ ... my stuff elided ... ] > So the client would do something like: > > struct uio auio; > struct iovec aiov; > off_t new_offset; > caddr_t buf; > > /* > * Setup to read the raw directory block. > */ > buf = aiov.iov_base = malloc(dir_block_size); > aiov.iov_len = dir_block_size; > auio.uio_iov = &aiov; > auio.uio_iovcnt = 1; > auio.uio_rw = UIO_READ; > auio.uio_segflg = UIO_SYSSPACE; > auio.uio_procp = p; > auio.uio_resid = dir_block_size; > auio.uio_offset = (fp->f_offset >> entry_bits) << dir_block_shift; > > /* > * Read the block > */ > error = VOP_READDIR(vp, &auio, ...); > > /* > * Parse the block into the user's memory, returning the new offset > * where reading can be restarted. Probably use another struct uio > * here to allow converting into both sys and user space. > */ > error = VOP_DIRCVT(vp, buf, uap->buf, uap->count, &new_offset); > > How do you propose avoiding the extra copy for UFS where the directory > does not need converting? A copy from the kernel space buffer to the user space buffer must always take place, unless you start being tricky about process space mapping. The sneaky bit is that the VOP_READDIR doesn't ever do a copy, it only faults in the directory block in question (if it's not in core) or moves it to the top of the LRU (if it is). The VOP_DIRCVT is the one that takes the uio: the purpose of a uio in the directory case is to describe a movement of data from kernel to user space. So your second comment is correct. The side benefit of handling a directory block that contains more entries than can be returned in a user directory block falls out from the fact that we have changed the restart mechanism: now it is possible to take a single disk directory block and parse it into multiple user dirent entries in the user supplied buffer. The restart mechanism means that we can do this "safely". The next question I'd have, if I were reading this for the first time, is "why not just make the VOP_READDIR act this way itself, throwing out the old VOP_READDIR and the new VOP_DIRCVT. The answer to this is non-obvious: by decoupling the operations, you allow the copy operation to be "delayed". This is an issue of FS layering, more than anything else. If I stack several file system layers on top of each other, the underlying vnode containing the directory is the one I am interested in having translated. So the VOP_READDIR *also* takes a (struct vnode **) to be filled out by the underlying FS that answers the VOP_READDIR request. The VOP_DIRCVT acts on this vnode. For the majority of cases, this is the same vnode as the call-down interface for the top level directory being read, since most FS's are single layer (or integrated layer, like those which use ufs and have statically placed ufs_* calls in their VOP tables). But in the nullfs, lofs, or hypothetical compression or encryption layers, etc., the vp is that of the underlying layer and thus the VOP_DIRCVT associated with the vp is the correct one for the data being retrieved. > Doug Rabson, Microsoft RenderMorphics Ltd. Mail: dfr@render.com PS: Hey! You guys got bought it looks like! Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199512152002.NAA04666>