Date: Tue, 10 Apr 2001 23:53:12 -0700 (PDT) From: Matt Dillon <dillon@earth.backplane.com> To: Andrew Heybey <ath@niksun.com> Cc: Peter Jeremy <peter.jeremy@alcatel.com.au>, freebsd-arch@FreeBSD.ORG Subject: Re: mmap(2) vs read(2)/write(2) Message-ID: <200104110653.f3B6rCT98951@earth.backplane.com> References: <20010411095233.P66243@gsmx07.alcatel.com.au> <200104110234.f3B2Ysj97756@earth.backplane.com> <85d7akqf9h.fsf@stiegl.niksun.com>
index | next in thread | previous in thread | raw e-mail
:I discovered this a while ago myself. In my experiment I did
:madadvise(..., MADV_SEQUENTIAL) rather than MADV_WILLNEED. Would
:doing MADV_SEQUENTIAL in addition to MADV_WILLNEED be useful?
As of 4.1 the VM heuristic does a really excellent job figuring out
your access pattern, so you do not need to lock it in with an
madvise(). Also as of 4.1 or so the VM fault patterns are tracked
on a per-process basis (in the vm_map_entry), independant of accesses
made by other processes and also independant of VFS operations like
lseek(), read(), and write(). And, since it's done in the vm_map_entry,
the fault patterns are regionalized within each mmap'd block. So the
VM system's heuristic will not get confused if several processes
are accessing the same file in different ways and can also calculate the
heuristic on the different mmap'd areas (data, bss, text, shared libraries,
multiple mmap()'s that you make) independantly.
So MADV_WILLNEED (and perhaps DONTNEED) is really all you need to be
optimal.
:Another thing that I noticed is that if the data are not already in
:the cache, then mmap() will read from disk every time (even if the
:file will fit in memory) while read() will leave the data in the
:cache. So when reading a file that will fit in memory, the fastest was
:read the first time followed by mmap for subsequent passes. This was
:on 3.2, however, maybe things have changed since then?
:
:andrew
4.x definitely caches the data read in through page faults. 3.x should
have too, though perhaps not quite as well.
We've done a bunch of work in 4.x to try to prevent cache blow-aways,
which may be what you are seeing. A cache blow-away is where you have a
normal system with a bunch of cached data and then go in and blow it away
by, say, greping through a 1G file . Obviously for that case you do not
want the scan of the 1G file to blow away all the wonderfully cached data
you already have! Just accessing a piece of data once is not
enough to cache it over data that might already be in the cache. Example:
./rf -m test2
./rf -m test2
./rf -m test2
cksum 0 read 33554432 bytes in 0.270 seconds, 118.310 MB/sec cpu 0.273 sec
ns1:/home/dillon> ./rf -f test1
cksum 0 read 1073741824 bytes in 43.381 seconds, 23.605 MB/sec cpu 11.228 sec
ns1:/home/dillon> ./rf -m test2
cksum 0 read 33554432 bytes in 0.271 seconds, 118.288 MB/sec cpu 0.265 sec
Remember, test1 is the huge file, test2 is the small file. We force
test2 into the cache more permanently by repeatedly accessing it. We
then sequentially read test1. But note that when we read test2 again
that it still gets 118MB/sec ... the read of the 1G test1 file did
*NOT* blow away the system's cache of the test2 data.
Here's another example. If you blow away the cache by reading test1
through an mmap, then try to read test2 through an mmap a couple of
times:
ns1:/home/dillon> ./rf -m test1
cksum 0 read 1073741824 bytes in 48.717 seconds, 21.019 MB/sec cpu 11.962 sec
ns1:/home/dillon> ./rf -m test2
cksum 0 read 33554432 bytes in 0.945 seconds, 33.873 MB/sec cpu 0.329 sec
ns1:/home/dillon> ./rf -m test2
cksum 0 read 33554432 bytes in 0.898 seconds, 35.636 MB/sec cpu 0.290 sec
ns1:/home/dillon> ./rf -m test2
cksum 0 read 33554432 bytes in 0.418 seconds, 76.566 MB/sec cpu 0.272 sec
ns1:/home/dillon> ./rf -m test2
cksum 0 read 33554432 bytes in 0.271 seconds, 118.153 MB/sec cpu 0.272 sec
ns1:/home/dillon> ./rf -m test2
cksum 0 read 33554432 bytes in 0.271 seconds, 118.243 MB/sec cpu 0.272 sec
Notice that test2 is not being 100% cached in the first pass. test2 in
this case is 32MB of data. The system is not willing to throw away
32MB of data cached prior to accessing test2. But after a couple of
scans of test2 the system figures out that you really do want to cache
all 32MB.
Now, unfortunately, the blow-away prevention algorithm has not yet been
refined, so it has different effects on read() verses mmap/read methods
of scanning a file. It works for both, but read() goes through the
buffer cache and since backing pages for the buffer cache are wired,
read() winds up with a small edge over mmap() in regards to page priority.
Blow-away is handled through a combination of several algorithms. First
the VM PAGE queue's native priority assignment algorithm gives newly
cached pages a 'neutral' priority rather then a high priority, which gives
them room to go up or down in priority (Rik is very familiar with this).
This is the bullwork. There are two other algorithms involved, however.
First, the sequential heuristic attempts to depress the priority of
pages behind the read at the same time it attempts to read pages ahead
of the read. Second, the VM system has a little algorithm to avoid
silly-recycling syndrome. This occurs when all the pages in the system
are at a higher priority and you wind up instantly (too quickly)
recycling the pages you just read in due to their neutral priority.
The solution is to not blindly depress the priority of pages behind
the read but to instead give a small percentage of them a higher priority
so they stick around longer.
If you were to repeat the above test using ./rf -f test2 you would notice
that it caches the whole file right off the bat, whereas ./rf -m test2
did not cache the whole (32MB) file right off the bat. This is an
example of the differences that still exist between VFS ops and MMAP
ops. It's good enough to prevent cache blow-aways, but isn't yet
optimal or generic across the different access methods.
Ya never thought caching could be this complicated, eh?
-Matt
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message
help
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200104110653.f3B6rCT98951>
