Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 14 Dec 1995 11:55:06 -0700 (MST)
From:      Terry Lambert <terry@lambert.org>
To:        dfr@render.com (Doug Rabson)
Cc:        terry@lambert.org, current@freebsd.org
Subject:   Re: VOP_READIR revisited
Message-ID:  <199512141855.LAA02247@phaeton.artisoft.com>
In-Reply-To: <Pine.BSF.3.91.951214170028.457H-100000@minnow.render.com> from "Doug Rabson" at Dec 14, 95 05:08:23 pm

next in thread | previous in thread | raw e-mail | index | archive | help
> > The VOP_READDIR call implements the getdirentries (getdents if you are
> > a POSIX compliant OS) system call.  It returns the entries in a file
> > system independent structure, and uses a "cookie" mechanism to allow
> > the search to restart on non-directory block entry boundries... this
> > is typically used for NFS single entry and repositioning operations.
> > 
> > 
> > The ufs_readdir version of VOP_READDIR for UFS derived file systems
> > MALLOC's cookie buffers, and in general wreaks havoc.
> 
> 
> The UFS readdir mallocs space for the format conversion if
> BYTE_ORDER == LITTLE_ENDIAN && ap->a_vp->v_mount->mnt_maxsymlinklen > 0
> 
> As far as I can see, this is something to do with byte swapping the 
> on-disc data structure which appears to be big-endian in this case.

This will occur for every LITTLE_ENDIAN (the only place FreeBSD is
distributed for right now) machine that has been upgraded without
newfs'ing the disk (ie: old format UFS file systems).  The value of
mnt_maxsymlinklen is used to flag old vs. new.

There is an additional translation and malloc in the ogetdirentries case.

Both the ogetdirentries and the LITTLE_ENDIAN + old format are not
likely to affect a lot of users.

> It has nothing to do with cookies.

I know that.  The search restart isn't an issue except in the NFS case.

...well, and the telldir/seekdir/readdir case, which is currently broken
according to existing practice, but technically correct according to
the documentation.  It's broken *because* it doesn't use cookies or
some other restart mechanism.

> Cookies are only allocated if the readdir is called from the NFS
> server.  The getdirentries syscall doesn't supply the cookie pointer,
> so no extra work is done for cookies.
> 
> Why is it better to make the client perform 2 vop calls (READDIR in 
> native format, then DIRCVT into getdirentries standard form)?

Probably the same reason it's "better" to make the client call VOP_LOOKUP,
where the first thing it does is call cache_lookup, to get to the name
cache instead of calling cache_lookup in the lookup() in vfs_lookup.c
and save the VOP_LOOKUP call in the cache hit case.  Just joking.  8-).


The reason it is better is because the mallocs occur when there is
a potential buffer size mismatch between the caller and the underlying
FS.

The actual non-NFS case I was attempting to refer to was the cd9660_readdir
MALLOC call.  This will be true of any FS where the on disk structure is
potentially > sizeof(struct dirent).  And indeed, there is a useless
MALLOC in the cd9660_readdir().

The NFS case itself arises because of the NFS transported dirent structure
being a different size than the one used internally by BSD.

It turns out that this will generally be the case in non-UFS derived
directory structures.



Now the call overhead is not significant.  It's more instruction overhead
that the block sizes are being passed around and divided by instead of
using a poser of 2 bit offset and shifting instead of dividing (ie: in
the numerous cases in the read path).  On average, it maxes at 48
clock ticks, and 12 of that is the calls to the VOP and the VOCALL,
assuming they aren't inlined (which they are supposed to be according
to the vnode_if.h contents).

I believe we will make up many of the 48 additional cycles in avoiding
the call overhead in passing the cookie crap around, and in the testing
for its existance, even in the UFS case.  In the non-UFS case, it is
an obvious win, since it buys the ability to vastly simplify the per FS
VOP_READDIR code.  And it buys us functionality in the directory block
with smaller than struct dirent internal coding, which is completely
lacking right now.


So let's concentrate on the benefits:

The primary benefit is a search restart without use of cookies.

The secondary benefit is an elimination of the malloc in the "on disk
directory entry larger than the 'cannonical' directory entry" case.

The tertiary benefit is the to support restart of the search in the
case of blocked directory access (ala UFS directory blocks) where the
on disk structures are in fact smaller than the 'cannonical' directory
entry.  This case is currently not handled at all at the system call
layer, since a fully non-sparse directory of this format will *require*
a restart mid block for each and every cannonical block returned to
the user buffer!


How do we eliminate restart overhead (cookies)?

It turns out that there is a minimum and maximum entry size for any on
disk format, which we can scope as:

	2^n < min <= 2^(n+1)
	2^m < max <= 2^(m+1)
	n <= m

This gives us a range of 1 to B/(m-n) entries per block.

For UFS, with a directory block size of 512b and a min directory entry
size of 12 and a max of 264, this gives us a range of 1-42 entries
per directory: a total of 6 bits (log2(42)+1 == 6).

With a VOP_DIRCVT, we can vary the number of bits on the decode internally
to the file system type on a type-specific basis.

This gives us a range of 2^(32-6) to 2^(32-1) entries per directory as
a limitation, assuming a 32 rather than a 64 bit directory offset.

With this limitation in effect, given a directory vnode and a 32 bit
offset, we have no need of cookies and can restart the search at any
point.  We must accept that a UFS directory is limited to 2^26 entries
instead of 2^31 for a 32 bit off_t, or 2^58 instead of 2^63 for a 64
bit off_t.  This is still *significantly* larger than the max number
of inodes on a 9g drive by many orders of magnitude (a 9G drive could
have ~2^34 inodes if it had no superblock, disk slice, or partitioning
and stored no data and no directory information: a pretty useless
limitation case).  It is, in fact, 2^24 time larger than we could
conceivably need on a 9G drive with 64 bit off_t in the limitation case.

To avoid confusion: the 6 bit value is a scaled offset, which is scaled
relative to the minimum size for an entry -- it is *not* a lexical offset
of "directory in block".

The scaled offset is advanced past the entry to be returned so that the
valid offset prior to the returned index but *after* the index prior is
the restart location.  This is consistent with the restart backoff
mechanism inherent in the cookie restart, and is consistent with a
restart of a readdir following a telldir/closedir/opendir/seekdir
sequence: the current problem with the WINE and SAMBA code.



					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199512141855.LAA02247>