Date: Mon, 7 May 2001 17:56:01 -0700 (PDT) From: Matt Dillon <dillon@earth.backplane.com> To: Rik van Riel <riel@conectiva.com.br> Cc: <arch@freebsd.org>, <linux-mm@kvack.org>, <sfkaplan@cs.amherst.edu> Subject: Re: on load control / process swapping Message-ID: <200105080056.f480u1Q71866@earth.backplane.com> References: <Pine.LNX.4.33.0105071956180.18102-100000@duckman.distro.conectiva>
next in thread | previous in thread | raw e-mail | index | archive | help
:> to be moved out of that queue for a minimum period of time based
:> on page aging. See line 500 or so of vm_pageout.c (in -stable) .
:>
:> Thus when a process wakes up and pages a bunch of pages in, those
:> pages are guarenteed to stay in-core for a period of time no matter
:> what level of memory stress is occuring.
:
:I don't see anything limiting the speed at which the active list
:is scanned over and over again. OTOH, you are right that a failure
:to deactivate enough pages will trigger the swapout code .....
:
:This sure is a subtle interaction ;)
Look at the loop line 1362 of vm_pageout.c. Note that it enforces
a HZ/2 tsleep (2 scans per second) if the pageout daemon is unable
to clean sufficient pages in two loops. The tsleep is not woken up
by anyone while waiting that 1/2 second becuase vm_pages_needed has
not been cleared yet. This is what is limiting the page queue scan.
:> When a process is swapped out, the process is removed from the run
:> queue and the P_INMEM flag is cleared. The process is only woken up
:> when faultin() is called (vm_glue.c line 312). faultin() is only
:> called from the scheduler() (line 340 of vm_glue.c) and the scheduler
:> only runs when the VM system indicates a minimum number of free pages
:> are available (vm_page_count_min()), which you can adjust with
:> the vm.v_free_min sysctl (usually represents 1-9 megabytes, dependings
:> on how much memory the system has).
:
:But ... is this a good enough indication that the processes
:currently resident have enough memory available to make any
:progress ?
Yes. Consider detecting the difference between a large process accessing
its pages randomly, and a small process accessing a relatively small
set of pages over and over again. Now consider what happens when the
system gets overloaded. The small process will be able to access its
pages enough that they will get page priority over the larger process.
The larger process, due to the more random accesses (or simply the fact
that it is accessing a larger set of pages) will tend to stall more on
pagein I/O which has the side effect of reducing the large process's
access rate on all of its pages. The result: small processes get more
priority just by being small.
:Especially if all the currently resident processes are waiting
:in page faults, won't that make it easier for the system to find
:pages to swap out, etc... ?
:
:One thing I _am_ wondering though: the pageout and the pagein
:thresholds are different. Can't this lead to problems where we
:always hit both the pageout threshold -and- the pagein threshold
:and the system thrashes swapping processes in and out ?
The system will not page out a page it has just paged in due to the
center-of-the-road initialization of act_count (the page aging).
My experience at BEST was that both pagein and pageout activity
occured simultaniously, but the fact had no detrimental effect on
the system. You have to treat the pagein and pageout operations
independantly because, in fact, they are only weakly related to each
other. The only optimization you make, to reduce thrashing, is to
not allow a just-paged-in page to immediately turn around and be paged
out.
I could probably make this work even better by setting the vm_page_t's
act_count to its max value when paging in from swap. I'll think about
doing that.
The pagein and pageout rates have nothing to do with thrashing, per say,
and should never be arbitrarily limited. Consider the difference
between a system that is paing heavily and a system with only two small
processes (like cp's) competing for disk I/O. Insofar as I/O goes,
there is no difference. You can have a perfectly running system with
high pagein and pageout rates. It's only when the paging I/O starts
to eat into pages that are in active use where thrashing begins to occur.
Think of a hotdog being eaten from both ends by two lovers. Memory
pressure (active VM pages) eat away at one end, pageout I/O eats away
at the other. You don't get fireworks until they meet.
:> ago that attempting to rate limit paging does not actually solve the
:> thrashing problem, it actually makes it worse... So they solved the
:> problem another way (see my answers for #1 and #2). It isn't the
:> paging operations themselves that cause thrashing.
:
:Agreed on all points ... I'm just wondering how well 1) and 2)
:still work after all the changes that were made to the VM in
:the last few years. They sure are subtle ...
The algorithms mostly stayed the same. Much of the work was to remove
artificial limitations that were reducing performance (due to the
existance of greater amounts of memory, faster disks, and so forth...).
I also spent a good deal of time removing 'restart' cases from the code
that was causing a lot of cpu-wasteage in certain cases. What few
restart cases remain just don't occur all that often. And I've done
other things like extend the heuristics we already use for read()/write()
to the VM system and change heuristic variables into per-vm-map elements
rather then sharing them with read/write within the vnode. Etc.
:> Small process can contribute to thrashing as easily as large
:> processes can under extreme memory pressure... for example,
:> take an overloaded shell machine. *ALL* processes are 'small'
:> processes in that case, or most of them are, and in great numbers
:> they can be the cause. So no test that specifically checks the
:> size of the process can be used to give it any sort of priority.
:
:There's a test related to 2) though ... A small process needs
:to be in memory less time than a big process in order to make
:progress, so it can be swapped out earlier.
Not necessarily. It depends whether the small process is cpu-bound
or interactive. A cpu-bound small process should be allowed to run
and not swapped out. An interactive small process can be safely
swapped if idle for a period of time, because it can be swapped back
in very quickly. It should not be swapped if it isn't idle (someone is
typing, for example), because that would just waste disk I/O paging out
and then paging right back in. You never want to swapout a small
process gratuitously simply because it is small.
:It can also be swapped back in earlier, giving small processes
:shorter "time slices" for swapping than what large processes
:have. I'm not quite sure how much this would matter, though...
Both swapin and swapout activities are demand paged, but will be
clustered if possible. I don't think there would be any point
trying to conditionalize the algorithm based on the size of the
process. The size has its own indirect positive effects which I
think are sufficient.
:Interesting, FreeBSD indeed _does_ seem to have all of the things in
:place (though the interactions between the various parts seem to be
:carefully hidden ;)).
:
:They indeed should work for lots of scenarios, but things like the
:subtlety of some of the code and the fact that the swapin and
:swapout thresholds are fairly unrelated look a bit worrying...
:
:regards,
:
:Rik
I don't think it's possible to write a nice neat thrash-handling
algorithm. It's a bunch of algorithms all working together, all
closely tied to the VM page cache. Each taken alone is fairly easy
to describe and understand. All of them together result in complex
interactions that are very easy to break if you make a mistake. It
usually takes me a couple of tries to get a solution to a problem in
place without breaking something else (performance-wise) in the
process. For example, I fubar'd heavy load performance for a month
in FreeBSD-4.2 when I 'fixed' the pageout scan laundering algorithm.
-Matt
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200105080056.f480u1Q71866>
