From owner-freebsd-hackers Sun Jan 26 22: 6:18 2003 Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id EA61737B401 for ; Sun, 26 Jan 2003 22:06:14 -0800 (PST) Received: from puffin.mail.pas.earthlink.net (puffin.mail.pas.earthlink.net [207.217.120.139]) by mx1.FreeBSD.org (Postfix) with ESMTP id 41B8F43F13 for ; Sun, 26 Jan 2003 22:06:14 -0800 (PST) (envelope-from tlambert2@mindspring.com) Received: from pool0003.cvx22-bradley.dialup.earthlink.net ([209.179.198.3] helo=mindspring.com) by puffin.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 18d2PC-0005Ur-00; Sun, 26 Jan 2003 22:06:03 -0800 Message-ID: <3E34CA7F.D0E0A260@mindspring.com> Date: Sun, 26 Jan 2003 21:58:23 -0800 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: kientzle@acm.org Cc: Matthew Dillon , Sean Hamilton , hackers@FreeBSD.ORG Subject: Re: Random disk cache expiry References: <000501c2c4dd$f43ed450$16e306cf@slugabed.org> <200301261931.h0QJVCp8052101@apollo.backplane.com> <3E348B51.6F4D6096@mindspring.com> <200301270142.h0R1guR3070182@apollo.backplane.com> <3E3494CC.5895492D@mindspring.com> <3E34A6BB.2090601@acm.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a4508c9def3780424d915691749bd60981667c3043c0873f7e350badd9bab72f9c350badd9bab72f9c Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Tim Kientzle wrote: > Cycling through large data sets is not really that uncommon. > I do something like the following pretty regularly: > find /usr/src -type f | xargs grep function_name > > Even scanning through a large dataset once can really hurt > competing applications on the same machine by flushing > their data from the cache for no gain. I think this > is where randomized expiration might really win, by reducing the > penalty for disk-cache-friendly applications who are competing > with disk-cache-unfriendly applications. Let's set aside for a minute that this discussion does not -- can not -- apply to directory metadata, and that the limiting factor in your example is going to be the divorce of inode data from vnodes in the UFS/UFS2/EXT2 ihash code, or, if there is a low maximum vnode count, the vnodes themselves, and in any divorce case or vnode LRU, the association between the cached data and the FS object will be lost -- effectively losing the cached data, no matter what, since it can't be recovered, even if it's in memory, because nothing references it or remembers which pages it's in. Let's say, instead, that we are only talking about the data being grep'ed. There are a million ways to "win" against such applications; one obvious one is to establish a "cache time to live" for any cached data, and not evict if it has not passed its time to live (effectively establishing two zones, one for cached data, and one for transient data which would be cached, if you had the RAM for it). Doing this randomized drop approach is bogus. What you are effectively hoping for is that the pages will not be evicted before they are hit again, because your shotgun pellets will not shoot them down. I can prove mathematically (or you can read it out of Knuth's Seminumerical Algorithms: Sorting and Searching) that even with a "perfect" random number generator, as soon as the cache size is 115% or less of the data set size, you are almost certainly guaranteed that the pages will be evicted before they are rereferenced, given sequential access. > There's an extensive literature on randomized algorithms. > Although I'm certainly no expert, I understand that such > algorithms work very well in exactly this sort of application, > since they "usually" avoid worst-case behavior under a broad > variety of inputs. The current cache is, in essence, > tuned specifically to work badly on a system where applications > are scanning through large amounts of data. No matter what > deterministic caching algorithm you use, you're choosing > to behave badly under some situation. As I stated before, this is frequently used to "cheat" benchmarks that are intended to evict cache content to defeat performance improvements from caching (bechmarks intended to "cheat" caching). The "magic" part here, though is that the working set size ends up being only fractionally larger than the cache size. For example, the "Polygraph" benchmark intentionally "busts" the cache on intermediate servers, to make a political statement: http://www.web-polygraph.org/docs/workloads/ Network Appliance and other vendors have a random page replacement policy that scores them much better numbers on this test, simply by doing random replacement instead of LRU or some other weight based replacement which the test software is capable of detecting during the "run up" phase of the test, where th software attempts to determine the "peak rate" following cache overflow. While this gives "good numbers" for the benchmark, in a real workload, this is unlikely to be representative of real performance. Likewuse, since their intent is to find the "knee" in the performance curve, and populate the cache beyond that, then do testing there (thus "proving" that all connections "should be" end-to-end, rather than permitting the existance of proxies), this is unlikely to be representative of a real-world cache overflow condition, since in the real world, the overflow is unlikely to be limited to "101% of cache capacity", but will instead not stop there; going above 15% of planned capacity will result in significant failure of random page replacement to result in cache hits; likewise, going to 85% overage will practically guarantee an almost 100% failure rate, as cyclical access with random replacement is statistically more likely, in aggregate, to replace the pages which are there longer (the probability is iterative and additive: it's effectively a permutation). > Personally, I think there's a lot of merit to _trying_ I think that if this is implemented by someone who doesn't want to believe Donald Knuth (8-)), they should at least be willing to test with a data set of 3 times the available cache size, so that they can see that the algorithm does not result in any performance improvement. > randomized disk cache expiry and seeing how it works in practice. It should work very poorly, according to the math. Another approach to "get what you want" would be preferential caching of metadata vis-a-vis user data, for your specific example: by caching metadata preferentially, you avoid a significant number of data waits on directory and inode data, each of which represents a system call and a user space program stall, as a result. > (I would also observe here that 5.0 now has a fast, high-quality > source of randomness that seems ideal for exactly such > applications.) I don't believe that it would _prevent_ applications > from using optimizations such as those that Terry suggests, > while possibly providing reasonable performance under a > broader range of scenarios than are currently supported. > > Sounds like a good idea to me. As I said above: if you want to do it, do it. Then benchmark it under load conditions that aren't fictitiously close to "just above 100% cache fill". If you do this, it would also be useful to benchmark the "hit rate". On the other hand, if the intent of this is manufacture benchmarks on which FreeBSD can look much better than other OS's, there's a lot of space for that already, making it very easy to do, if you were inspired to do that. Personally, my take on that is that it's tantamount to lying; it's one of the reasons most of us, when referring to these things, call them "microbenchmarks" instead of just "benchmarks": they measure things, and such measurements can be made out of context -- if they are not inherently out of context (i.e. you could use "getpid" as you "null system call", and then modify the libc stub to cache the result, and zap it to -1 after fork, so that it would be recached, and the you would "get good numbers" for the "null system call" -- which were not really measuring what they were supposed to be). "There are lies, damn lies, and statistics" ... and benchmarks. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message