Date: Thu, 14 Dec 1995 11:55:06 -0700 (MST) From: Terry Lambert <terry@lambert.org> To: dfr@render.com (Doug Rabson) Cc: terry@lambert.org, current@freebsd.org Subject: Re: VOP_READIR revisited Message-ID: <199512141855.LAA02247@phaeton.artisoft.com> In-Reply-To: <Pine.BSF.3.91.951214170028.457H-100000@minnow.render.com> from "Doug Rabson" at Dec 14, 95 05:08:23 pm
next in thread | previous in thread | raw e-mail | index | archive | help
> > The VOP_READDIR call implements the getdirentries (getdents if you are > > a POSIX compliant OS) system call. It returns the entries in a file > > system independent structure, and uses a "cookie" mechanism to allow > > the search to restart on non-directory block entry boundries... this > > is typically used for NFS single entry and repositioning operations. > > > > > > The ufs_readdir version of VOP_READDIR for UFS derived file systems > > MALLOC's cookie buffers, and in general wreaks havoc. > > > The UFS readdir mallocs space for the format conversion if > BYTE_ORDER == LITTLE_ENDIAN && ap->a_vp->v_mount->mnt_maxsymlinklen > 0 > > As far as I can see, this is something to do with byte swapping the > on-disc data structure which appears to be big-endian in this case. This will occur for every LITTLE_ENDIAN (the only place FreeBSD is distributed for right now) machine that has been upgraded without newfs'ing the disk (ie: old format UFS file systems). The value of mnt_maxsymlinklen is used to flag old vs. new. There is an additional translation and malloc in the ogetdirentries case. Both the ogetdirentries and the LITTLE_ENDIAN + old format are not likely to affect a lot of users. > It has nothing to do with cookies. I know that. The search restart isn't an issue except in the NFS case. ...well, and the telldir/seekdir/readdir case, which is currently broken according to existing practice, but technically correct according to the documentation. It's broken *because* it doesn't use cookies or some other restart mechanism. > Cookies are only allocated if the readdir is called from the NFS > server. The getdirentries syscall doesn't supply the cookie pointer, > so no extra work is done for cookies. > > Why is it better to make the client perform 2 vop calls (READDIR in > native format, then DIRCVT into getdirentries standard form)? Probably the same reason it's "better" to make the client call VOP_LOOKUP, where the first thing it does is call cache_lookup, to get to the name cache instead of calling cache_lookup in the lookup() in vfs_lookup.c and save the VOP_LOOKUP call in the cache hit case. Just joking. 8-). The reason it is better is because the mallocs occur when there is a potential buffer size mismatch between the caller and the underlying FS. The actual non-NFS case I was attempting to refer to was the cd9660_readdir MALLOC call. This will be true of any FS where the on disk structure is potentially > sizeof(struct dirent). And indeed, there is a useless MALLOC in the cd9660_readdir(). The NFS case itself arises because of the NFS transported dirent structure being a different size than the one used internally by BSD. It turns out that this will generally be the case in non-UFS derived directory structures. Now the call overhead is not significant. It's more instruction overhead that the block sizes are being passed around and divided by instead of using a poser of 2 bit offset and shifting instead of dividing (ie: in the numerous cases in the read path). On average, it maxes at 48 clock ticks, and 12 of that is the calls to the VOP and the VOCALL, assuming they aren't inlined (which they are supposed to be according to the vnode_if.h contents). I believe we will make up many of the 48 additional cycles in avoiding the call overhead in passing the cookie crap around, and in the testing for its existance, even in the UFS case. In the non-UFS case, it is an obvious win, since it buys the ability to vastly simplify the per FS VOP_READDIR code. And it buys us functionality in the directory block with smaller than struct dirent internal coding, which is completely lacking right now. So let's concentrate on the benefits: The primary benefit is a search restart without use of cookies. The secondary benefit is an elimination of the malloc in the "on disk directory entry larger than the 'cannonical' directory entry" case. The tertiary benefit is the to support restart of the search in the case of blocked directory access (ala UFS directory blocks) where the on disk structures are in fact smaller than the 'cannonical' directory entry. This case is currently not handled at all at the system call layer, since a fully non-sparse directory of this format will *require* a restart mid block for each and every cannonical block returned to the user buffer! How do we eliminate restart overhead (cookies)? It turns out that there is a minimum and maximum entry size for any on disk format, which we can scope as: 2^n < min <= 2^(n+1) 2^m < max <= 2^(m+1) n <= m This gives us a range of 1 to B/(m-n) entries per block. For UFS, with a directory block size of 512b and a min directory entry size of 12 and a max of 264, this gives us a range of 1-42 entries per directory: a total of 6 bits (log2(42)+1 == 6). With a VOP_DIRCVT, we can vary the number of bits on the decode internally to the file system type on a type-specific basis. This gives us a range of 2^(32-6) to 2^(32-1) entries per directory as a limitation, assuming a 32 rather than a 64 bit directory offset. With this limitation in effect, given a directory vnode and a 32 bit offset, we have no need of cookies and can restart the search at any point. We must accept that a UFS directory is limited to 2^26 entries instead of 2^31 for a 32 bit off_t, or 2^58 instead of 2^63 for a 64 bit off_t. This is still *significantly* larger than the max number of inodes on a 9g drive by many orders of magnitude (a 9G drive could have ~2^34 inodes if it had no superblock, disk slice, or partitioning and stored no data and no directory information: a pretty useless limitation case). It is, in fact, 2^24 time larger than we could conceivably need on a 9G drive with 64 bit off_t in the limitation case. To avoid confusion: the 6 bit value is a scaled offset, which is scaled relative to the minimum size for an entry -- it is *not* a lexical offset of "directory in block". The scaled offset is advanced past the entry to be returned so that the valid offset prior to the returned index but *after* the index prior is the restart location. This is consistent with the restart backoff mechanism inherent in the cookie restart, and is consistent with a restart of a readdir following a telldir/closedir/opendir/seekdir sequence: the current problem with the WINE and SAMBA code. Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199512141855.LAA02247>