Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 26 Jan 2003 21:58:23 -0800
From:      Terry Lambert <tlambert2@mindspring.com>
To:        kientzle@acm.org
Cc:        Matthew Dillon <dillon@apollo.backplane.com>, Sean Hamilton <sh@bel.bc.ca>, hackers@FreeBSD.ORG
Subject:   Re: Random disk cache expiry
Message-ID:  <3E34CA7F.D0E0A260@mindspring.com>
References:  <000501c2c4dd$f43ed450$16e306cf@slugabed.org> <200301261931.h0QJVCp8052101@apollo.backplane.com> <3E348B51.6F4D6096@mindspring.com> <200301270142.h0R1guR3070182@apollo.backplane.com> <3E3494CC.5895492D@mindspring.com> <3E34A6BB.2090601@acm.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Tim Kientzle wrote:
> Cycling through large data sets is not really that uncommon.
> I do something like the following pretty regularly:
>     find /usr/src -type f | xargs grep function_name
> 
> Even scanning through a large dataset once can really hurt
> competing applications on the same machine by flushing
> their data from the cache for no gain.  I think this
> is where randomized expiration might really win, by reducing the
> penalty for disk-cache-friendly applications who are competing
> with disk-cache-unfriendly applications.

Let's set aside for a minute that this discussion does not
-- can not -- apply to directory metadata, and that the limiting
factor in your example is going to be the divorce of inode data
from vnodes in the UFS/UFS2/EXT2 ihash code, or, if there is a
low maximum vnode count, the vnodes themselves, and in any
divorce case or vnode LRU, the association between the cached
data and the FS object will be lost -- effectively losing the
cached data, no matter what, since it can't be recovered, even
if it's in memory, because nothing references it or remembers
which pages it's in.

Let's say, instead, that we are only talking about the data
being grep'ed.

There are a million ways to "win" against such applications;
one obvious one is to establish a "cache time to live" for any
cached data, and not evict if it has not passed its time to
live (effectively establishing two zones, one for cached data,
and one for transient data which would be cached, if you had
the RAM for it).


Doing this randomized drop approach is bogus.  What you are
effectively hoping for is that the pages will not be evicted
before they are hit again, because your shotgun pellets will
not shoot them down.

I can prove mathematically (or you can read it out of Knuth's
Seminumerical Algorithms: Sorting and Searching) that even with
a "perfect" random number generator, as soon as the cache size
is 115% or less of the data set size, you are almost certainly
guaranteed that the pages will be evicted before they are
rereferenced, given sequential access.


> There's an extensive literature on randomized algorithms.
> Although I'm certainly no expert, I understand that such
> algorithms work very well in exactly this sort of application,
> since they "usually" avoid worst-case behavior under a broad
> variety of inputs.  The current cache is, in essence,
> tuned specifically to work badly on a system where applications
> are scanning through large amounts of data.  No matter what
> deterministic caching algorithm you use, you're choosing
> to behave badly under some situation.

As I stated before, this is frequently used to "cheat" benchmarks
that are intended to evict cache content to defeat performance
improvements from caching (bechmarks intended to "cheat" caching).

The "magic" part here, though is that the working set size ends
up being only fractionally larger than the cache size.  For
example, the "Polygraph" benchmark intentionally "busts" the cache
on intermediate servers, to make a political statement:

	http://www.web-polygraph.org/docs/workloads/

Network Appliance and other vendors have a random page replacement
policy that scores them much better numbers on this test, simply
by doing random replacement instead of LRU or some other weight
based replacement which the test software is capable of detecting
during the "run up" phase of the test, where th software attempts
to determine the "peak rate" following cache overflow.

While this gives "good numbers" for the benchmark, in a real workload,
this is unlikely to be representative of real performance.  Likewuse,
since their intent is to find the "knee" in the performance curve, and
populate the cache beyond that, then do testing there (thus "proving"
that all connections "should be" end-to-end, rather than permitting
the existance of proxies), this is unlikely to be representative of
a real-world cache overflow condition, since in the real world, the
overflow is unlikely to be limited to "101% of cache capacity", but
will instead not stop there; going above 15% of planned capacity
will result in significant failure of random page replacement to
result in cache hits; likewise, going to 85% overage will practically
guarantee an almost 100% failure rate, as cyclical access with random
replacement is statistically more likely, in aggregate, to replace
the pages which are there longer (the probability is iterative and
additive: it's effectively a permutation).


> Personally, I think there's a lot of merit to _trying_

I think that if this is implemented by someone who doesn't want to
believe Donald Knuth (8-)), they should at least be willing to
test with a data set of 3 times the available cache size, so
that they can see that the algorithm does not result in any
performance improvement.


> randomized disk cache expiry and seeing how it works in practice.

It should work very poorly, according to the math.

Another approach to "get what you want" would be preferential
caching of metadata vis-a-vis  user data, for your specific
example: by caching metadata preferentially, you avoid a
significant number of data waits on directory and inode data,
each of which represents a system call and a user space program
stall, as a result.


> (I would also observe here that 5.0 now has a fast, high-quality
> source of randomness that seems ideal for exactly such
> applications.)  I don't believe that it would _prevent_ applications
> from using optimizations such as those that Terry suggests,
> while possibly providing reasonable performance under a
> broader range of scenarios than are currently supported.
> 
> Sounds like a good idea to me.

As I said above: if you want to do it, do it.  Then benchmark it
under load conditions that aren't fictitiously close to "just
above 100% cache fill".

If you do this, it would also be useful to benchmark the "hit rate".

On the other hand, if the intent of this is manufacture benchmarks
on which FreeBSD can look much better than other OS's, there's a
lot of space for that already, making it very easy to do, if you
were inspired to do that.

Personally, my take on that is that it's tantamount to lying; it's
one of the reasons most of us, when referring to these things, call
them "microbenchmarks" instead of just "benchmarks": they measure
things, and such measurements can be made out of context -- if they
are not inherently out of context (i.e. you could use "getpid" as
you "null system call", and then modify the libc stub to cache the
result, and zap it to -1 after fork, so that it would be recached,
and the you would "get good numbers" for the "null system call" --
which were not really measuring what they were supposed to be).

"There are lies, damn lies, and statistics" ... and benchmarks.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3E34CA7F.D0E0A260>