Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 7 May 2001 15:50:20 -0700 (PDT)
From:      Matt Dillon <dillon@earth.backplane.com>
To:        Rik van Riel <riel@conectiva.com.br>
Cc:        arch@freebsd.org, linux-mm@kvack.org, sfkaplan@cs.amherst.edu
Subject:   Re: on load control / process swapping
Message-ID:  <200105072250.f47MoKe68863@earth.backplane.com>
References:   <Pine.LNX.4.21.0105061924160.582-100000@imladris.rielhome.conectiva>

next in thread | previous in thread | raw e-mail | index | archive | help
:In short, the process suspension / wake up code only does
:load control in the sense that system load is reduced, but
:absolutely no effort is made to ensure that individual
:programs can run without thrashing. This, of course, kind of
:defeats the purpose of doing load control in the first place.
:
:
:To see this situation in some more detail, lets first look
:at how the current process suspension code has evolved over
:time.  Early paging Unixes, including earlier BSDs, had a
:rate-limited clock algorithm for the pageout code, where
:the VM subsystem would only scan (and page) memory out at
:a rate of fastscan pages per second.
:
:Whenever the paging system wasn't able to keep up, free
:memory would get below a certain threshold and memory load
:control (in the form of process suspension) kicked in.  As
:soon as free memory (averaged over a few seconds) got over
:this threshold, processes get swapped in again.  Because of
:the exact "speed limit" for the paging code, this would give
:a slow rotation of memory-resident progesses at a paging rate
:well below the thashing threshold.
:
:More modern Unixes, like FreeBSD, NetBSD or Linux, however,
:don't have the artificial speed limit on pageout.  This means
:the pageout code can go on freeing memory until well beyond
:the trashing point of the system.  It also means that the
:amount of free memory is no longer any indication of whether
:the system is thrashing or not.
:
:Add to that the fact that the classical load control in BSD
:resumes a suspended task whenever the system is above the
:(now not very meaningful) free memory threshold, regardless
:of whether the resident tasks have had the opportunity to
:make any progress ... which of course only encourages more
:thrashing instead of letting the system work itself out of
:the overload situation.
:
:
:Any solution will have to address the following points:
:
:1) allow the resident processes to stay resident long
:   enough to make progess

    This is accomplished as a side effect to the way the page queues
    are handled.  A page placed in the active queue is not allowed
    to be moved out of that queue for a minimum period of time based
    on page aging.  See line 500 or so of vm_pageout.c (in -stable) .

    Thus when a process wakes up and pages a bunch of pages in, those
    pages are guarenteed to stay in-core for a period of time no matter
    what level of memory stress is occuring.

:2) make sure the resident processes aren't thrashing,
:   that is, don't let new processes back in memory if
:   none of the currently resident processes is "ready"
:   to be suspended

    When a process is swapped out, the process is removed from the run
    queue and the P_INMEM flag is cleared.  The process is only woken up
    when faultin() is called (vm_glue.c line 312).  faultin() is only
    called from the scheduler() (line 340 of vm_glue.c) and the scheduler
    only runs when the VM system indicates a minimum number of free pages
    are available (vm_page_count_min()), which you can adjust with
    the vm.v_free_min sysctl (usually represents 1-9 megabytes, dependings
    on how much memory the system has).

    So what occurs is that the system comes under extreme memory pressure
    and starts to swapout blocked processes.  This reduces memory pressure
    over time.  When memory pressure is sufficiently reudced the scheduler
    wakes up a swapped-out process (one at a time).

    There might be some fine tuning that we can do here, such as try to
    choose a better process to swapout (right now it's priority based which
    isn't the best way to do it).

:3) have a mechanism to detect thrashing in a VM
:   subsystem which isn't rate-limited  (hard?)

    In FreeBSD, rate-limiting is a function of a lightly loaded system.
    We rate-limit page laundering (pageouts).  However, if the rate-limited
    laundering is not sufficient to reach our free + cache page targets,
    we take another laundering loop and this time do not limit it at all.

    Thus under heavy memory pressure, no real rate limiting occurs.  The
    system will happily pagein and pageout megabytes/sec.  The reason we
    do this is because David Greenman and John Dyson found a long time
    ago that attempting to rate limit paging does not actually solve the
    thrashing problem, it actually makes it worse... So they solved the
    problem another way (see my answers for #1 and #2).  It isn't the
    paging operations themselves that cause thrashing.

:and, for extra brownie points:
:4) fairness, small processes can be paged in and out
:   faster, so we can suspend&resume them faster; this
:   has the side effect of leaving the proverbial root
:   shell more usable

    Small process can contribute to thrashing as easily as large
    processes can under extreme memory pressure... for example,
    take an overloaded shell machine.  *ALL* processes are 'small'
    processes in that case, or most of them are, and in great numbers
    they can be the cause.  So no test that specifically checks the
    size of the process can be used to give it any sort of priority.

    Additionally, *idle* small processes are also great contributers 
    to the VM subsystem in regards to clearing out idle pages.  For
    example, on a heavily loaded shell machine more then 80% of the
    'small processes' have been idle for long periods of time and it
    is exactly our ability to page them out that allows us to extend
    the machine's operational life and move the thrashing threshold
    farther away.  The last thing we want to do is make a 'fix' that
    prevents us from paging out idle small processes.  It would kill
    the machine.

:5) make sure already resident processes cannot create
:   a situation that'll keep the swapped out tasks out
:   of memory forever ... but don't kill performance either,
:   since bad performance means we cannot get out of the
:   bad situation we're in

    When the system starts swapping processes out, it continues to swap
    them out until memory pressure goes down.  With memory pressure down
    processes are swapped back in again one at a time, typically in FIFO
    order.  So this situation will generally not occur.

    Basically we have all the algorithms in place to deal with thrashing.
    I'm sure that there are a few places where we can optimize things...
    for example, we can certainly tune the swapout algorithm itself.

						-Matt

:regards,
:
:Rik
:--
:Virtual memory is like a game you can't win;

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200105072250.f47MoKe68863>