Date: Fri, 15 Dec 1995 12:15:08 +0000 (GMT) From: Doug Rabson <dfr@render.com> To: Terry Lambert <terry@lambert.org> Cc: terry@lambert.org, current@freebsd.org Subject: Re: VOP_READIR revisited Message-ID: <Pine.BSF.3.91.951215112604.457L-100000@minnow.render.com> In-Reply-To: <199512141855.LAA02247@phaeton.artisoft.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 14 Dec 1995, Terry Lambert wrote: > This will occur for every LITTLE_ENDIAN (the only place FreeBSD is > distributed for right now) machine that has been upgraded without > newfs'ing the disk (ie: old format UFS file systems). The value of > mnt_maxsymlinklen is used to flag old vs. new. > > There is an additional translation and malloc in the ogetdirentries case. > > Both the ogetdirentries and the LITTLE_ENDIAN + old format are not > likely to affect a lot of users. OK, that make sense. > > > It has nothing to do with cookies. > > I know that. The search restart isn't an issue except in the NFS case. > > ...well, and the telldir/seekdir/readdir case, which is currently broken > according to existing practice, but technically correct according to > the documentation. It's broken *because* it doesn't use cookies or > some other restart mechanism. I think it is technically broken according to the documentation but the documentation implies behaviour which is wrong IMHO. In any case the abnormal usage which WINE made of this has been changed to something more conservative (assuming they integrate Julian's patch that is). > > > Cookies are only allocated if the readdir is called from the NFS > > server. The getdirentries syscall doesn't supply the cookie pointer, > > so no extra work is done for cookies. > > > > Why is it better to make the client perform 2 vop calls (READDIR in > > native format, then DIRCVT into getdirentries standard form)? > > Probably the same reason it's "better" to make the client call VOP_LOOKUP, > where the first thing it does is call cache_lookup, to get to the name > cache instead of calling cache_lookup in the lookup() in vfs_lookup.c > and save the VOP_LOOKUP call in the cache hit case. Just joking. 8-). And I didn't think we were even talking about name caches B) > > > The reason it is better is because the mallocs occur when there is > a potential buffer size mismatch between the caller and the underlying > FS. > > The actual non-NFS case I was attempting to refer to was the cd9660_readdir > MALLOC call. This will be true of any FS where the on disk structure is > potentially > sizeof(struct dirent). And indeed, there is a useless > MALLOC in the cd9660_readdir(). > > The NFS case itself arises because of the NFS transported dirent structure > being a different size than the one used internally by BSD. > > It turns out that this will generally be the case in non-UFS derived > directory structures. > > > > Now the call overhead is not significant. It's more instruction overhead > that the block sizes are being passed around and divided by instead of > using a poser of 2 bit offset and shifting instead of dividing (ie: in > the numerous cases in the read path). On average, it maxes at 48 > clock ticks, and 12 of that is the calls to the VOP and the VOCALL, > assuming they aren't inlined (which they are supposed to be according > to the vnode_if.h contents). > > I believe we will make up many of the 48 additional cycles in avoiding > the call overhead in passing the cookie crap around, and in the testing > for its existance, even in the UFS case. In the non-UFS case, it is > an obvious win, since it buys the ability to vastly simplify the per FS > VOP_READDIR code. And it buys us functionality in the directory block > with smaller than struct dirent internal coding, which is completely > lacking right now. Don't be too sure you will win by shaving a couple of arguments off the function call. All the memory involved is in the cache and writing NULL to a cached memory location is virtually free. I don't agree that replacing a function which reads and parses a directory block in one operation with two functions, one which reads a block and one which parses it is code simplification. > > > So let's concentrate on the benefits: > > The primary benefit is a search restart without use of cookies. > > The secondary benefit is an elimination of the malloc in the "on disk > directory entry larger than the 'cannonical' directory entry" case. How is the malloc eliminated? Surely the caller will have to first malloc space for reading the fs-specific directory block and then parse that into getdirentries format. It seems to me that in the most common case (UFS), the caller will do *more* work, not less. > > The tertiary benefit is the to support restart of the search in the > case of blocked directory access (ala UFS directory blocks) where the > on disk structures are in fact smaller than the 'cannonical' directory > entry. This case is currently not handled at all at the system call > layer, since a fully non-sparse directory of this format will *require* > a restart mid block for each and every cannonical block returned to > the user buffer! > > > How do we eliminate restart overhead (cookies)? > > It turns out that there is a minimum and maximum entry size for any on > disk format, which we can scope as: > > 2^n < min <= 2^(n+1) > 2^m < max <= 2^(m+1) > n <= m > > This gives us a range of 1 to B/(m-n) entries per block. > > For UFS, with a directory block size of 512b and a min directory entry > size of 12 and a max of 264, this gives us a range of 1-42 entries > per directory: a total of 6 bits (log2(42)+1 == 6). > > With a VOP_DIRCVT, we can vary the number of bits on the decode internally > to the file system type on a type-specific basis. > > This gives us a range of 2^(32-6) to 2^(32-1) entries per directory as > a limitation, assuming a 32 rather than a 64 bit directory offset. > > With this limitation in effect, given a directory vnode and a 32 bit > offset, we have no need of cookies and can restart the search at any > point. We must accept that a UFS directory is limited to 2^26 entries > instead of 2^31 for a 32 bit off_t, or 2^58 instead of 2^63 for a 64 > bit off_t. This is still *significantly* larger than the max number > of inodes on a 9g drive by many orders of magnitude (a 9G drive could > have ~2^34 inodes if it had no superblock, disk slice, or partitioning > and stored no data and no directory information: a pretty useless > limitation case). It is, in fact, 2^24 time larger than we could > conceivably need on a 9G drive with 64 bit off_t in the limitation case. > > To avoid confusion: the 6 bit value is a scaled offset, which is scaled > relative to the minimum size for an entry -- it is *not* a lexical offset > of "directory in block". > > The scaled offset is advanced past the entry to be returned so that the > valid offset prior to the returned index but *after* the index prior is > the restart location. This is consistent with the restart backoff > mechanism inherent in the cookie restart, and is consistent with a > restart of a readdir following a telldir/closedir/opendir/seekdir > sequence: the current problem with the WINE and SAMBA code. So the client would do something like: struct uio auio; struct iovec aiov; off_t new_offset; caddr_t buf; /* * Setup to read the raw directory block. */ buf = aiov.iov_base = malloc(dir_block_size); aiov.iov_len = dir_block_size; auio.uio_iov = &aiov; auio.uio_iovcnt = 1; auio.uio_rw = UIO_READ; auio.uio_segflg = UIO_SYSSPACE; auio.uio_procp = p; auio.uio_resid = dir_block_size; auio.uio_offset = (fp->f_offset >> entry_bits) << dir_block_shift; /* * Read the block */ error = VOP_READDIR(vp, &auio, ...); /* * Parse the block into the user's memory, returning the new offset * where reading can be restarted. Probably use another struct uio * here to allow converting into both sys and user space. */ error = VOP_DIRCVT(vp, buf, uap->buf, uap->count, &new_offset); How do you propose avoiding the extra copy for UFS where the directory does not need converting? -- Doug Rabson, Microsoft RenderMorphics Ltd. Mail: dfr@render.com Phone: +44 171 251 4411 FAX: +44 171 251 0939
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.3.91.951215112604.457L-100000>