From owner-freebsd-arch  Tue Apr 10 23:54:35 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from earth.backplane.com (earth-nat-cw.backplane.com [208.161.114.67])
	by hub.freebsd.org (Postfix) with ESMTP id 5FE7F37B422
	for <freebsd-arch@FreeBSD.ORG>; Tue, 10 Apr 2001 23:54:28 -0700 (PDT)
	(envelope-from dillon@earth.backplane.com)
Received: (from dillon@localhost)
	by earth.backplane.com (8.11.2/8.11.2) id f3B6rCT98951;
	Tue, 10 Apr 2001 23:53:12 -0700 (PDT)
	(envelope-from dillon)
Date: Tue, 10 Apr 2001 23:53:12 -0700 (PDT)
From: Matt Dillon <dillon@earth.backplane.com>
Message-Id: <200104110653.f3B6rCT98951@earth.backplane.com>
To: Andrew Heybey <ath@niksun.com>
Cc: Peter Jeremy <peter.jeremy@alcatel.com.au>,
	freebsd-arch@FreeBSD.ORG
Subject: Re: mmap(2) vs read(2)/write(2)
References: <20010411095233.P66243@gsmx07.alcatel.com.au> <200104110234.f3B2Ysj97756@earth.backplane.com> <85d7akqf9h.fsf@stiegl.niksun.com>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

:I discovered this a while ago myself.  In my experiment I did
:madadvise(..., MADV_SEQUENTIAL) rather than MADV_WILLNEED.  Would
:doing MADV_SEQUENTIAL in addition to MADV_WILLNEED be useful?

    As of 4.1 the VM heuristic does a really excellent job figuring out
    your access pattern, so you do not need to lock it in with an
    madvise().  Also as of 4.1 or so the VM fault patterns are tracked
    on a per-process basis (in the vm_map_entry), independant of accesses
    made by other processes and also independant of VFS operations like
    lseek(), read(), and write().  And, since it's done in the vm_map_entry,
    the fault patterns are regionalized within each mmap'd block.  So the 
    VM system's heuristic will not get confused if several processes 
    are accessing the same file in different ways and can also calculate the
    heuristic on the different mmap'd areas (data, bss, text, shared libraries,
    multiple mmap()'s that you make) independantly.

    So MADV_WILLNEED (and perhaps DONTNEED) is really all you need to be
    optimal.

:Another thing that I noticed is that if the data are not already in
:the cache, then mmap() will read from disk every time (even if the
:file will fit in memory) while read() will leave the data in the
:cache.  So when reading a file that will fit in memory, the fastest was 
:read the first time followed by mmap for subsequent passes.  This was
:on 3.2, however, maybe things have changed since then?
:
:andrew

    4.x definitely caches the data read in through page faults.  3.x should
    have too, though perhaps not quite as well. 

    We've done a bunch of work in 4.x to try to prevent cache blow-aways, 
    which may be what you are seeing.   A cache blow-away is where you have a
    normal system with a bunch of cached data and then go in and blow it away
    by, say, greping through a 1G file .  Obviously for that case you do not
    want the scan of the 1G file to blow away all the wonderfully cached data
    you already have!  Just accessing a piece of data once is not
    enough to cache it over data that might already be in the cache.  Example:

./rf -m test2
./rf -m test2
./rf -m test2
cksum 0 read 33554432 bytes in 0.270 seconds, 118.310 MB/sec cpu 0.273 sec
ns1:/home/dillon> ./rf -f test1
cksum 0 read 1073741824 bytes in 43.381 seconds, 23.605 MB/sec cpu 11.228 sec
ns1:/home/dillon> ./rf -m test2
cksum 0 read 33554432 bytes in 0.271 seconds, 118.288 MB/sec cpu 0.265 sec

    Remember, test1 is the huge file, test2 is the small file.  We force 
    test2 into the cache more permanently by repeatedly accessing it.  We 
    then sequentially read test1.  But note that when we read test2 again
    that it still gets 118MB/sec ... the read of the 1G test1 file did
    *NOT* blow away the system's cache of the test2 data.

    Here's another example.  If you blow away the cache by reading test1
    through an mmap, then try to read test2 through an mmap a couple of
    times:

ns1:/home/dillon> ./rf -m test1
cksum 0 read 1073741824 bytes in 48.717 seconds, 21.019 MB/sec cpu 11.962 sec

ns1:/home/dillon> ./rf -m test2
cksum 0 read 33554432 bytes in 0.945 seconds, 33.873 MB/sec cpu 0.329 sec
ns1:/home/dillon> ./rf -m test2
cksum 0 read 33554432 bytes in 0.898 seconds, 35.636 MB/sec cpu 0.290 sec
ns1:/home/dillon> ./rf -m test2
cksum 0 read 33554432 bytes in 0.418 seconds, 76.566 MB/sec cpu 0.272 sec
ns1:/home/dillon> ./rf -m test2
cksum 0 read 33554432 bytes in 0.271 seconds, 118.153 MB/sec cpu 0.272 sec
ns1:/home/dillon> ./rf -m test2
cksum 0 read 33554432 bytes in 0.271 seconds, 118.243 MB/sec cpu 0.272 sec

    Notice that test2 is not being 100% cached in the first pass.  test2 in
    this case is 32MB of data.  The system is not willing to throw away
    32MB of data cached prior to accessing test2.  But after a couple of
    scans of test2 the system figures out that you really do want to cache
    all 32MB. 

    Now, unfortunately, the blow-away prevention algorithm has not yet been
    refined, so it has different effects on read() verses mmap/read methods
    of scanning a file.  It works for both, but read() goes through the
    buffer cache and since backing pages for the buffer cache are wired,
    read() winds up with a small edge over mmap() in regards to page priority.
    Blow-away is handled through a combination of several algorithms.  First
    the VM PAGE queue's native priority assignment algorithm gives newly
    cached pages a 'neutral' priority rather then a high priority, which gives
    them room to go up or down in priority (Rik is very familiar with this).
    This is the bullwork.  There are two other algorithms involved, however.
    First, the sequential heuristic attempts to depress the priority of
    pages behind the read at the same time it attempts to read pages ahead
    of the read.  Second, the VM system has a little algorithm to avoid
    silly-recycling syndrome.  This occurs when all the pages in the system
    are at a higher priority and you wind up instantly (too quickly)
    recycling the pages you just read in due to their neutral priority.
    The solution is to not blindly depress the priority of pages behind
    the read but to instead give a small percentage of them a higher priority
    so they stick around longer.

    If you were to repeat the above test using ./rf -f test2 you would notice
    that it caches the whole file right off the bat, whereas ./rf -m test2
    did not cache the whole (32MB) file right off the bat.  This is an 
    example of the differences that still exist between VFS ops and MMAP
    ops.  It's good enough to prevent cache blow-aways, but isn't yet 
    optimal or generic across the different access methods.

    Ya never thought caching could be this complicated, eh?

						-Matt


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message