Date: Mon, 7 May 2001 17:56:01 -0700 (PDT) From: Matt Dillon <dillon@earth.backplane.com> To: Rik van Riel <riel@conectiva.com.br> Cc: <arch@freebsd.org>, <linux-mm@kvack.org>, <sfkaplan@cs.amherst.edu> Subject: Re: on load control / process swapping Message-ID: <200105080056.f480u1Q71866@earth.backplane.com> References: <Pine.LNX.4.33.0105071956180.18102-100000@duckman.distro.conectiva>
next in thread | previous in thread | raw e-mail | index | archive | help
:>     to be moved out of that queue for a minimum period of time based
:>     on page aging.  See line 500 or so of vm_pageout.c (in -stable) .
:>
:>     Thus when a process wakes up and pages a bunch of pages in, those
:>     pages are guarenteed to stay in-core for a period of time no matter
:>     what level of memory stress is occuring.
:
:I don't see anything limiting the speed at which the active list
:is scanned over and over again. OTOH, you are right that a failure
:to deactivate enough pages will trigger the swapout code .....
:
:This sure is a subtle interaction ;)
    Look at the loop line 1362 of vm_pageout.c.  Note that it enforces
    a HZ/2 tsleep (2 scans per second) if the pageout daemon is unable
    to clean sufficient pages in two loops.  The tsleep is not woken up
    by anyone while waiting that 1/2 second becuase vm_pages_needed has
    not been cleared yet.  This is what is limiting the page queue scan.
:>     When a process is swapped out, the process is removed from the run
:>     queue and the P_INMEM flag is cleared.  The process is only woken up
:>     when faultin() is called (vm_glue.c line 312).  faultin() is only
:>     called from the scheduler() (line 340 of vm_glue.c) and the scheduler
:>     only runs when the VM system indicates a minimum number of free pages
:>     are available (vm_page_count_min()), which you can adjust with
:>     the vm.v_free_min sysctl (usually represents 1-9 megabytes, dependings
:>     on how much memory the system has).
:
:But ... is this a good enough indication that the processes
:currently resident have enough memory available to make any
:progress ?
    Yes.  Consider detecting the difference between a large process accessing
    its pages randomly, and a small process accessing a relatively small
    set of pages over and over again.  Now consider what happens when the
    system gets overloaded.  The small process will be able to access its
    pages enough that they will get page priority over the larger process.
    The larger process, due to the more random accesses (or simply the fact
    that it is accessing a larger set of pages) will tend to stall more on
    pagein I/O which has the side effect of reducing the large process's
    access rate on all of its pages.  The result:  small processes get more
    priority just by being small.
:Especially if all the currently resident processes are waiting
:in page faults, won't that make it easier for the system to find
:pages to swap out, etc... ?
:
:One thing I _am_ wondering though: the pageout and the pagein
:thresholds are different. Can't this lead to problems where we
:always hit both the pageout threshold -and- the pagein threshold
:and the system thrashes swapping processes in and out ?
    The system will not page out a page it has just paged in due to the
    center-of-the-road initialization of act_count (the page aging).
    My experience at BEST was that both pagein and pageout activity
    occured simultaniously, but the fact had no detrimental effect on
    the system.  You have to treat the pagein and pageout operations
    independantly because, in fact, they are only weakly related to each
    other.  The only optimization you make, to reduce thrashing, is to
    not allow a just-paged-in page to immediately turn around and be paged
    out.
    I could probably make this work even better by setting the vm_page_t's
    act_count to its max value when paging in from swap.  I'll think about
    doing that.
    The pagein and pageout rates have nothing to do with thrashing, per say,
    and should never be arbitrarily limited.   Consider the difference
    between a system that is paing heavily and a system with only two small
    processes (like cp's) competing for disk I/O.  Insofar as I/O goes,
    there is no difference.  You can have a perfectly running system with
    high pagein and pageout rates.  It's only when the paging I/O starts
    to eat into pages that are in active use where thrashing begins to occur.
    Think of a hotdog being eaten from both ends by two lovers.  Memory
    pressure (active VM pages) eat away at one end, pageout I/O eats away
    at the other.  You don't get fireworks until they meet.
:>     ago that attempting to rate limit paging does not actually solve the
:>     thrashing problem, it actually makes it worse... So they solved the
:>     problem another way (see my answers for #1 and #2).  It isn't the
:>     paging operations themselves that cause thrashing.
:
:Agreed on all points ... I'm just wondering how well 1) and 2)
:still work after all the changes that were made to the VM in
:the last few years.  They sure are subtle ...
    The algorithms mostly stayed the same.  Much of the work was to remove
    artificial limitations that were reducing performance (due to the
    existance of greater amounts of memory, faster disks, and so forth...).
    I also spent a good deal of time removing 'restart' cases from the code
    that was causing a lot of cpu-wasteage in certain cases.  What few
    restart cases remain just don't occur all that often.  And I've done
    other things like extend the heuristics we already use for read()/write()
    to the VM system and change heuristic variables into per-vm-map elements
    rather then sharing them with read/write within the vnode.  Etc.
:>     Small process can contribute to thrashing as easily as large
:>     processes can under extreme memory pressure... for example,
:>     take an overloaded shell machine.  *ALL* processes are 'small'
:>     processes in that case, or most of them are, and in great numbers
:>     they can be the cause.  So no test that specifically checks the
:>     size of the process can be used to give it any sort of priority.
:
:There's a test related to 2) though ... A small process needs
:to be in memory less time than a big process in order to make
:progress, so it can be swapped out earlier.
    Not necessarily.  It depends whether the small process is cpu-bound
    or interactive.  A cpu-bound small process should be allowed to run
    and not swapped out.  An interactive small process can be safely
    swapped if idle for a period of time, because it can be swapped back
    in very quickly.  It should not be swapped if it isn't idle (someone is
    typing, for example), because that would just waste disk I/O paging out
    and then paging right back in.  You never want to swapout a small
    process gratuitously simply because it is small.
:It can also be swapped back in earlier, giving small processes
:shorter "time slices" for swapping than what large processes
:have.  I'm not quite sure how much this would matter, though...
    Both swapin and swapout activities are demand paged, but will be
    clustered if possible.  I don't think there would be any point
    trying to conditionalize the algorithm based on the size of the
    process.  The size has its own indirect positive effects which I
    think are sufficient.
:Interesting, FreeBSD indeed _does_ seem to have all of the things in
:place (though the interactions between the various parts seem to be
:carefully hidden ;)).
:
:They indeed should work for lots of scenarios, but things like the
:subtlety of some of the code and the fact that the swapin and
:swapout thresholds are fairly unrelated look a bit worrying...
:
:regards,
:
:Rik
    I don't think it's possible to write a nice neat thrash-handling
    algorithm.  It's a bunch of algorithms all working together, all
    closely tied to the VM page cache.  Each taken alone is fairly easy
    to describe and understand.  All of them together result in complex
    interactions that are very easy to break if you make a mistake.  It
    usually takes me a couple of tries to get a solution to a problem in
    place without breaking something else (performance-wise) in the
    process.  For example, I fubar'd heavy load performance for a month
    in FreeBSD-4.2 when I 'fixed' the pageout scan laundering algorithm.
						-Matt
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200105080056.f480u1Q71866>
