From owner-freebsd-arch Tue Apr 10 23:54:35 2001 Delivered-To: freebsd-arch@freebsd.org Received: from earth.backplane.com (earth-nat-cw.backplane.com [208.161.114.67]) by hub.freebsd.org (Postfix) with ESMTP id 5FE7F37B422 for ; Tue, 10 Apr 2001 23:54:28 -0700 (PDT) (envelope-from dillon@earth.backplane.com) Received: (from dillon@localhost) by earth.backplane.com (8.11.2/8.11.2) id f3B6rCT98951; Tue, 10 Apr 2001 23:53:12 -0700 (PDT) (envelope-from dillon) Date: Tue, 10 Apr 2001 23:53:12 -0700 (PDT) From: Matt Dillon Message-Id: <200104110653.f3B6rCT98951@earth.backplane.com> To: Andrew Heybey Cc: Peter Jeremy , freebsd-arch@FreeBSD.ORG Subject: Re: mmap(2) vs read(2)/write(2) References: <20010411095233.P66243@gsmx07.alcatel.com.au> <200104110234.f3B2Ysj97756@earth.backplane.com> <85d7akqf9h.fsf@stiegl.niksun.com> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG :I discovered this a while ago myself. In my experiment I did :madadvise(..., MADV_SEQUENTIAL) rather than MADV_WILLNEED. Would :doing MADV_SEQUENTIAL in addition to MADV_WILLNEED be useful? As of 4.1 the VM heuristic does a really excellent job figuring out your access pattern, so you do not need to lock it in with an madvise(). Also as of 4.1 or so the VM fault patterns are tracked on a per-process basis (in the vm_map_entry), independant of accesses made by other processes and also independant of VFS operations like lseek(), read(), and write(). And, since it's done in the vm_map_entry, the fault patterns are regionalized within each mmap'd block. So the VM system's heuristic will not get confused if several processes are accessing the same file in different ways and can also calculate the heuristic on the different mmap'd areas (data, bss, text, shared libraries, multiple mmap()'s that you make) independantly. So MADV_WILLNEED (and perhaps DONTNEED) is really all you need to be optimal. :Another thing that I noticed is that if the data are not already in :the cache, then mmap() will read from disk every time (even if the :file will fit in memory) while read() will leave the data in the :cache. So when reading a file that will fit in memory, the fastest was :read the first time followed by mmap for subsequent passes. This was :on 3.2, however, maybe things have changed since then? : :andrew 4.x definitely caches the data read in through page faults. 3.x should have too, though perhaps not quite as well. We've done a bunch of work in 4.x to try to prevent cache blow-aways, which may be what you are seeing. A cache blow-away is where you have a normal system with a bunch of cached data and then go in and blow it away by, say, greping through a 1G file . Obviously for that case you do not want the scan of the 1G file to blow away all the wonderfully cached data you already have! Just accessing a piece of data once is not enough to cache it over data that might already be in the cache. Example: ./rf -m test2 ./rf -m test2 ./rf -m test2 cksum 0 read 33554432 bytes in 0.270 seconds, 118.310 MB/sec cpu 0.273 sec ns1:/home/dillon> ./rf -f test1 cksum 0 read 1073741824 bytes in 43.381 seconds, 23.605 MB/sec cpu 11.228 sec ns1:/home/dillon> ./rf -m test2 cksum 0 read 33554432 bytes in 0.271 seconds, 118.288 MB/sec cpu 0.265 sec Remember, test1 is the huge file, test2 is the small file. We force test2 into the cache more permanently by repeatedly accessing it. We then sequentially read test1. But note that when we read test2 again that it still gets 118MB/sec ... the read of the 1G test1 file did *NOT* blow away the system's cache of the test2 data. Here's another example. If you blow away the cache by reading test1 through an mmap, then try to read test2 through an mmap a couple of times: ns1:/home/dillon> ./rf -m test1 cksum 0 read 1073741824 bytes in 48.717 seconds, 21.019 MB/sec cpu 11.962 sec ns1:/home/dillon> ./rf -m test2 cksum 0 read 33554432 bytes in 0.945 seconds, 33.873 MB/sec cpu 0.329 sec ns1:/home/dillon> ./rf -m test2 cksum 0 read 33554432 bytes in 0.898 seconds, 35.636 MB/sec cpu 0.290 sec ns1:/home/dillon> ./rf -m test2 cksum 0 read 33554432 bytes in 0.418 seconds, 76.566 MB/sec cpu 0.272 sec ns1:/home/dillon> ./rf -m test2 cksum 0 read 33554432 bytes in 0.271 seconds, 118.153 MB/sec cpu 0.272 sec ns1:/home/dillon> ./rf -m test2 cksum 0 read 33554432 bytes in 0.271 seconds, 118.243 MB/sec cpu 0.272 sec Notice that test2 is not being 100% cached in the first pass. test2 in this case is 32MB of data. The system is not willing to throw away 32MB of data cached prior to accessing test2. But after a couple of scans of test2 the system figures out that you really do want to cache all 32MB. Now, unfortunately, the blow-away prevention algorithm has not yet been refined, so it has different effects on read() verses mmap/read methods of scanning a file. It works for both, but read() goes through the buffer cache and since backing pages for the buffer cache are wired, read() winds up with a small edge over mmap() in regards to page priority. Blow-away is handled through a combination of several algorithms. First the VM PAGE queue's native priority assignment algorithm gives newly cached pages a 'neutral' priority rather then a high priority, which gives them room to go up or down in priority (Rik is very familiar with this). This is the bullwork. There are two other algorithms involved, however. First, the sequential heuristic attempts to depress the priority of pages behind the read at the same time it attempts to read pages ahead of the read. Second, the VM system has a little algorithm to avoid silly-recycling syndrome. This occurs when all the pages in the system are at a higher priority and you wind up instantly (too quickly) recycling the pages you just read in due to their neutral priority. The solution is to not blindly depress the priority of pages behind the read but to instead give a small percentage of them a higher priority so they stick around longer. If you were to repeat the above test using ./rf -f test2 you would notice that it caches the whole file right off the bat, whereas ./rf -m test2 did not cache the whole (32MB) file right off the bat. This is an example of the differences that still exist between VFS ops and MMAP ops. It's good enough to prevent cache blow-aways, but isn't yet optimal or generic across the different access methods. Ya never thought caching could be this complicated, eh? -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message