From owner-freebsd-current Mon Dec 18 13:16:21 1995 Return-Path: owner-current Received: (from root@localhost) by freefall.freebsd.org (8.7.3/8.7.3) id NAA20187 for current-outgoing; Mon, 18 Dec 1995 13:16:21 -0800 (PST) Received: from minnow.render.com (render.demon.co.uk [158.152.30.118]) by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id NAA20178 for ; Mon, 18 Dec 1995 13:16:03 -0800 (PST) Received: (from dfr@localhost) by minnow.render.com (8.6.9/8.6.9) id NAA24452; Mon, 18 Dec 1995 13:46:54 GMT Date: Mon, 18 Dec 1995 13:46:53 +0000 (GMT) From: Doug Rabson To: Terry Lambert cc: terry@lambert.org, current@FreeBSD.ORG Subject: Re: VOP_READIR revisited In-Reply-To: <199512152002.NAA04666@phaeton.artisoft.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-current@FreeBSD.ORG Precedence: bulk On Fri, 15 Dec 1995, Terry Lambert wrote: > > > I know that. The search restart isn't an issue except in the NFS case. > > > > > > ...well, and the telldir/seekdir/readdir case, which is currently broken > > > according to existing practice, but technically correct according to > > > the documentation. It's broken *because* it doesn't use cookies or > > > some other restart mechanism. > > > > I think it is technically broken according to the documentation but the > > documentation implies behaviour which is wrong IMHO. In any case the > > abnormal usage which WINE made of this has been changed to something more > > conservative (assuming they integrate Julian's patch that is). > > I agree with this. I'd like that case to operate as it does on SunOS, > since that's as close to a "traditional" behaviour as we'll be able to > find. > > I think that by making the telldir token a 32 bit value instead of a > token will fix the problem instantly. No malloc so no need to maintain > nor free a token list. That would be nice. The current code is disgusting. > > Actually, there's no reason the current implementation can't be fixed > like this using the bit vector translation mechanism I suggested before. > Unfortunately, telldir is documented as returning a 'long', not an off_t, > so that places a 26 (+1) bit limit on the magnitued of the index, and > thus the directory size. Ugh. How embarrasing! A limitation of 134 million entries per directory :-) > > For a token conforming to the documentation, where an intervening > directory compaction could screw up the usability of the token because > of directory compaction (this limit arises because the directory block > buffer in core is a snapshot, taken atomically, of the buffer on disk), > a simple linear 32 bit offset can be used, using the off_t from the > actual fd open on the dir before the getdirentries() call, plus the > offset in the block snapshot of the in core block relative to its > start address. Then using the find-block-scan-forward-to-offset-and- > backof-on-miss approach (which would work right now in place of the > cookies, if you were willing to accept the occasional duplicate entry > in the compaction case), the token can be used to restart the search > after the dir has been reopened. > > The biggest danger here is handling the directory truncation which > 4.4BSD added to the directory compaction. > > > > Don't be too sure you will win by shaving a couple of arguments off the > > function call. All the memory involved is in the cache and writing NULL > > to a cached memory location is virtually free. > > I'm considering the additional overhead of the push of the call arguments > as well as the stack structure element assigns in the VOP_READDIR proper > (which I assume is what you are talking about?). There is also the cycles > burned on the compare/jump in the non-cookie case, and the fact that the > code generation is complicated by having another couple of values that > it will have to invalidate registers on to generate the code (there is > no way of hinting to the compiler that the cookie variables are referenced > in the least common case). The structure element assigns are stores of const values to stack memory which, I claim, is virtually free. The compare jump is 'less free' but the processor's branch prediction might help. The compiler's register usage should not be affected since that is the last basic block in the function. Not that gcc does a particularly good job in register allocation for the x86 anyway :-(. > > > I don't agree that replacing a function which reads and parses a > > directory block in one operation with two functions, one which reads a > > block and one which parses it is code simplification. > > Logical or algorithmic simplification? The two are not tightly bound. > > Basically, this is intended only to delay the disk-directory-contents > to user-snapshot-of-disk-directory-contents-in-user-format. Nothing > more. OK, I think. > > > > > So let's concentrate on the benefits: > > > > > > The primary benefit is a search restart without use of cookies. > > > > > > The secondary benefit is an elimination of the malloc in the "on disk > > > directory entry larger than the 'cannonical' directory entry" case. > > > > How is the malloc eliminated? Surely the caller will have to first > > malloc space for reading the fs-specific directory block and then parse > > that into getdirentries format. It seems to me that in the most common > > case (UFS), the caller will do *more* work, not less. > > Good question. > > It's eliminated because the buffer pointer in the directory vnode is > what is passed around to use in the translation to the user buffer. > > If I VOP_READDIR a directory block from a CDROM, for instance, the > directory block is put in core in a buffer referenced off the vnode. > > I can later use this buffer directly fro my source when translating the > entries to user space. > > This is actually one case where prevalidation of the target of a > copyout is a good idea: it avoids multiple validations in the copyout > case. So a local stack "struct dirent *" is all that's used. > > > > How do we eliminate restart overhead (cookies)? > > [ ... my stuff elided ... ] > > > [code deleted] > > > > How do you propose avoiding the extra copy for UFS where the directory > > does not need converting? > > A copy from the kernel space buffer to the user space buffer must always > take place, unless you start being tricky about process space mapping. > > The sneaky bit is that the VOP_READDIR doesn't ever do a copy, it only > faults in the directory block in question (if it's not in core) or moves > it to the top of the LRU (if it is). The VOP_DIRCVT is the one that > takes the uio: the purpose of a uio in the directory case is to describe > a movement of data from kernel to user space. So your second comment > is correct. I understand now. I actually quite like this idea now. It does make the interface cleaner. > > > The side benefit of handling a directory block that contains more entries > than can be returned in a user directory block falls out from the fact > that we have changed the restart mechanism: now it is possible to take > a single disk directory block and parse it into multiple user dirent > entries in the user supplied buffer. The restart mechanism means that > we can do this "safely". > > > The next question I'd have, if I were reading this for the first time, > is "why not just make the VOP_READDIR act this way itself, throwing out > the old VOP_READDIR and the new VOP_DIRCVT. My thoughts exactly. > > The answer to this is non-obvious: by decoupling the operations, you > allow the copy operation to be "delayed". This is an issue of FS > layering, more than anything else. If I stack several file system > layers on top of each other, the underlying vnode containing the > directory is the one I am interested in having translated. So the > VOP_READDIR *also* takes a (struct vnode **) to be filled out by the > underlying FS that answers the VOP_READDIR request. The VOP_DIRCVT > acts on this vnode. For the majority of cases, this is the same > vnode as the call-down interface for the top level directory being > read, since most FS's are single layer (or integrated layer, like > those which use ufs and have statically placed ufs_* calls in their > VOP tables). But in the nullfs, lofs, or hypothetical compression > or encryption layers, etc., the vp is that of the underlying layer > and thus the VOP_DIRCVT associated with the vp is the correct one > for the data being retrieved. Makes sense, I think. My head hurts. > > > Doug Rabson, Microsoft RenderMorphics Ltd. Mail: dfr@render.com > > PS: Hey! You guys got bought it looks like! 'Fraid so. I haven't had the mind control device implanted yet. At least, I don't think I have. Would I know? Help? -- Doug Rabson, Microsoft RenderMorphics Ltd. Mail: dfr@render.com Phone: +44 171 251 4411 FAX: +44 171 251 0939