Date: Tue, 29 May 2001 16:32:46 +1000 From: Peter Jeremy <peter.jeremy@alcatel.com.au> To: hackers@FreeBSD.ORG Subject: Re: technical comparison Message-ID: <20010529163246.H89950@gsmx07.alcatel.com.au>
next in thread | raw e-mail | index | archive | help
On Sun, 27 May 2001 22:50:48 -0300 (BRST), Rik van Riel <riel@conectiva.com.br> wrote: >On Sat, 26 May 2001, Peter Wemm wrote: >> Which is more expensive? Maintaining an on-disk hashed (or b+tree) >> directory format for *everything* or maintaining a simple low-cost >> format on disk with in-memory hashing for fast lookups? > >I bet that for modest directory sizes the cost of disk IO outweighs >the added CPU usage by so much that you may as well take the trouble >of using the more scalable directory format. I'm not sure I follow this. Reading sequentially is always going to be much faster than reading randomly. For a modest directory size, you run the risk that randomly accessing fewer blocks will actually take longer than just reading the entire directory sequentially. >> For the small directory case I suspect the FFS+namecache way is more >> cost effective. For the medium to large directory case (10,000 to >> 100,000 entries), I suspect the FFS+namecache method isn't too shabby, >> providing you are not starved for memory. For the insanely large >> cases - I dont want to think about :-). > >The ext2 fs, which uses roughly the same directory structure as >UFS and has a name cache which isn't limited in size, seems to >bog down at about 10,000 directory entries. As has been pointed out earlier, hash algorithms need a `maximum number of entries' parameter as part of their algorithm. Beyond some point, defined by this number, the hash will degenerate to (typically) O(N). It sounds like the Linux name cache hashing algorithm is not intended to handle so many directory entries. >Daniel Phillips is working on a hash extension to ext2; not a >replacement of the directory format, but a way to tack a hashed >index after the normal directory index. I think a tree structure is better than a hash because there is no inherent limit to the size (though the downside is O(log N) rather than close to fixed time). It may be possible to build a tree structure around the UFS directory block structure in such a way that it would be backward compatible[1]. Of course managing to correctly handle soft-updates write ordering for a tree re-balance is non-trivial. One point that hasn't come out so far is that reading a UFS is quite easy - hence boot2 can locate a loader or kernel by name within the root filesystem, rather than needing to hardware block numbers to load. If the directory structure does change, we need to ensure that it's possible to (possibly inefficiently) parse the structure in a fairly small amount of code. >It also has the advantage of being able to keep using the >tried&tested fsck utilities. Whatever is done, fsck would need to be enhanced to validate the directory structure, otherwise you could wind up with files that can't be found/deleted because they aren't where the hash/tree algorithm expects them. Suggestion for the "lets use the filesystem as a general purpose relational database" crowd: A userland implementation of the existing directory search scheme (ignoring name caching) would be trivial (see /usr/include/ufs/ufs/dir.h and dir(5) for details). Modify postmark (or similar) to simulate the creation/deletion of files in a userland `directory' structure and demonstrate an algorithm that is faster for the massive directory case and doesn't pessimize small directories. The effect of the name cache and datafile I/O should be able to be ignored since you just want to compare directory algorithms. [1] Keep entries within each block sorted. Reserve space at the end of the block for left and right child branch pointers and other overheads, with the left branch being less than the first entry in the block and the right branch being greater than the last entry. The reserved space is counted in d_reclen of the last entry (which makes it backward compatible). I haven't thought through the block splitting/merging algorithm so this may not work. Peter To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20010529163246.H89950>