Date: Thu, 26 Sep 2002 11:59:36 -0700 From: Terry Lambert <tlambert2@mindspring.com> To: David Malone <dwmalone@maths.tcd.ie> Cc: freebsd-current@FreeBSD.ORG Subject: Re: Journaled filesystem in CURRENT Message-ID: <3D935918.68C854E0@mindspring.com> References: <200209261930.aa30657@salmon.maths.tcd.ie>
next in thread | previous in thread | raw e-mail | index | archive | help
David Malone wrote: > > > On Thu, Sep 26, 2002 at 10:36:27AM -0700, Terry Lambert wrote: > > > > I think that what you were probably testing was directory entry > > > > layout and O(N) (linear) vs. O(log2(N)+1) search times for both > > > > non-existant entries on creates, and for any entry on lookup > > > > ( / 2 on lookup) . > > > > > > Though dirhash should eliminate most of this... > > > Everybody alsways says that, and then backs off, when they realize > > that a traversal of a mail queue of 100,000 entries, in which the > > destination is known by the contents of the file, rather than the > > file name, is involved. 8-). > > If you are searching based on contents of a file, then any directory > layout scheme will require mean N/2 probes on success and N on > failure surely? And if these probes are linear (ie. in the order > they are in the directory) then this really is O(N) both with and > without dirhash 'cos the probles will be O(1). ~O((N^4)/8)/2, actually. You linearly traverse for the queue element files, and then the queue elelement files tell you name of the queue content file, which you you have to look up. So it's a combined traversal and lookup on the same directory (in fact: a "dirhash buster", with some of the least optimal behaviour possible). There are two additional lookups, which occur to unlink the queue entry file, and the message file, so it's really (for n queue entries, which means twice that many directory entries): N = n*2 ---- \ > O(N) * O(N/2) * O(N/2) * O((N-1)/2) / ---- N = 0 (Assuming successful delivery and removal of the queue files on each element iterated). The way this is "fixed" in ext3 or most JFS implementations (both XFS and IBM's OS/2 JFS for Linux) is that the linear traversal is linear... meaning you don't restart the scan each time... and the explicit file lookup is O(log2(N+1)). N * log2(N+1)^3 is significantly smaller than (N^4)/8 (in case you were wondering about the "/8", it's because statistically, you only have to traverse 50% of the directory entries, on average, for a linear lookup that results in a "hit", but that only applies to explicit lookups, not the traversal); the result is (again for n queue entries): N = n*2 ---- \ > O(N) * O(log2(N+1)) * O(log2(N+1)) * O(log2(N)) / ---- N = 0 There are other data structures that could reduce this further than btree, FWIW, but implementing them in directories is moderately hard because of metadata ordering guarantees, and directory entry locking. Still, it's probably worth doing, if you can figure out a way to eliminate the need for directory vnode locking for modification operations (or can make them over into range-lock operations, instead). One "obvious" fix is to time-order file creations, to try and keep block locality "close" to time locality (i.e., if you are going to create 2 files (f1,f2) in some time interval [t1..t2], then you try to guarantee that the directory entry block that contains f2 is after the cone that contains f1, so that a linear progressive search from the current linear traversal location that resulted in f1 being found is likely to find f2, either immediately, or at least within the next block or two). The problem with doing this is the inability to ensure that the file you are creating does not exist... without a full traversal. This requires that there is some "cooperation" involved, so that the lookup traversal is picked up following the current offset of the linear traversal in progress. It also fails, if simultaneos traversals occur... assuming that the offset is not maintained per-process, but instead, per directory. If it's per-process, it actually works out, but only because each process in the sendmail case only has one queue run going on at a time. Note that none of this accounts for the queue entry creation of two additional files; that's a O(4*N+1), since create requires that the file not already exist (and is not helped by dirhash at all, being linear, by definition). For a btree, that's only O(2*log2(N+1)+2) for the two insertions. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3D935918.68C854E0>