Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 22 Jan 1999 18:17:54 -0500 (EST)
From:      "John S. Dyson" <dyson@iquest.net>
To:        dillon@apollo.backplane.com (Matthew Dillon)
Cc:        dyson@iquest.net, hackers@FreeBSD.ORG
Subject:   Re: Error in vm_fault change
Message-ID:  <199901222317.SAA36795@y.dyson.net>
In-Reply-To: <199901221953.LAA56414@apollo.backplane.com> from Matthew Dillon at "Jan 22, 99 11:53:25 am"

next in thread | previous in thread | raw e-mail | index | archive | help
Matthew Dillon said:
>
>     Basically what it comes down to is that I do not think it is appropriate
>     for there to be hacks all around the kernel to arbitrarily block processes
>     in low memory situations.  At the very worst, those same 'blockages' could
>     be implemented in one place - the memory allocator, and nowhere else.  But
>     we can do much better.
>
That isn't a hack.

> 
>     I like your RLIMIT_RSS code, but it isn't enough and I think it is
>     implemented in the wrong place.
>
Didn't guarantee that was clean, for sure.

> 
>     We can play with the scheduling and enforcement algorithms much more 
>     easily this way, too.  What do you say ?
> 
But don't take away the capabilities that are already in the code.  The
algorithms already work well, and removing things with the mistaken notion
that something is being "fixed" isn't a good thing (please refer to the
comment that you added with the #if 0.)

vm_page_alloc isn't likely the right place to put the policy either.  Note
that vm_page_alloc doesn't assume any kind of process context (or shouldn't).
Even though I put the object rundown in vm_page_free, that should also not
be there.  (The best way to do that would be a light weight kernel thread, but
my kernel supports those things very inexpensively, unlike the BSD kernels.)

If you want to put a layer between fault and vm_page_alloc, that *might* make
sense.  However, vm_fault is the only normally "approved" place that pages
are read in and put into the address space (at least on a dynamic basis.)  The
prefaulting does also, but that is much more static and definitely more machine
dependent.  IMO, prefaulting should never cause competition for resources,
so shouldn't be applicable to this discussion.

Again, you CAN put a layer in between vm_fault and vm_page_alloc/lookup --
however the only place where the policy makes sense is in the fault code (or a
routine called by it.)  vm_page_alloc is called by low level kernel services
that should not have policy associated with them.

Note that I created the fault status block -- which can allow for layering in
the fault code itself.  One purpose of it was to allow for splitting the code
cheaply (the other was to guarantee small offsets relative to a base pointer,
shrinking the code further), and also to get rid of the UGLY macros.  The
fault code is much smaller than it used to be (and even smaller than UVM.)  It
might be a good thing to split that up -- but maybe not.  Also, the fault
status block might allow for handling faults through continuations -- but the
rest of the VM code doesn't know how to deal with process stacks appearing and
disappearing :-(.

Given the VM code (and the way that it should be), the fault code is the ONLY
place where process pages are/should be directly created (in normal global
memory conditions.)  I suggest that adding a layer between fault and
alloc/lookup is probably redundant, because the place for the code is
already there!!!  

The only thing that the code in vm_pageout should do is generally to manage
global conditions (I think that soft-RSS limiting is okay there also -- but
likely not optimal.)  It is probably best to do the local soft-RSS limiting
also where the pages are allocated and managed (in vm_fault.)  One (the
original) reason for putting the soft-rss limiting in vm_pageout, is that
RSS limiting is generally undesirable unless memory conditions are low.
However, the same algorithm can be put into vm_fault (in fact, the *exact*
same code is called from vm_fault into the correct routine in vm_pageout.)
Please don't go in the direction of doing local trimming as a primary page
management method -- then that starts to regress to broken NT stuff.

Frankly, the version of the hard-RSS limiting that I am running performs much
better than earlier versions had (in the past), but I took some of the advice
that I gave you and implemented those capabilities.

Again, the code that you removed with '#if 0' is still operative in my kernel,
and the trimming works correctly.  Looking at the code that had been removed,
isn't it silly to have done so, when a more direct and correct method really
solves the problem? :-).

One thing about the VM code that is critical for the system to run correctly,
is that you must wake up the pageout daemon BEFORE you run out of memory.
Waiting until the system runs out of memory will weaken the probability
for the pageout daemon to run in parallel (in the sense of waiting for
I/O completion, or in SMP applications) with user processes.  Of course,
that is also predicated upon allowing the swap_pager to block :-).  If the
pageout code just blasts away clean pages after trying to launder a small
number of dirty pages, your system will also suffer, by having to re-read
those clean pages aggressively freed.  Reading pages from swap is pretty
quick, but undesirable when they might have been kept with a more carefully
considered pageout policy.

It only takes waking up the pageout daemon a few pages early for the
system to be able to take proper advantage of its behavior.  On large
systems, those "few" pages might number a few hundred, but who cares?
The mistakes made by allowing the pageout daemon to make mistakes
haphazardly cost alot because with broken algorithms, the daemon requires
more I/O, even if it looks like there is more free memory.  I/O and
latency are the parameters that needs to be optimized, who cares about 
any notion of "free" memory, when it doesn't mean anything to the
user -- due to excessive paging?

Note that it isn't usually pageouts that make processes (systems) run slow,
it is the lack of memory for page allocation, or it is latency time to
read needed pages from the disk.  Since you can only write out so many
pages per second, and since the pageout daemon policy is to be waken up
BEFORE the system is out of memory, and the cache queue (which is fairly
large, given proper policy) can be used to fullfill memory needs, then
the need or desirability of the pageout daemon going "nuts" by wrecklessly
freeing or caching pages isn't a desirable behavior.

The cache queue does buffer that errant behavior, but forgetting that it
only buffers the errant behavior of the pageout daemon is very unwise.  I
really hope that isn't what is happening now.

-- 
John                  | Never try to teach a pig to sing,
dyson@iquest.net      | it makes one look stupid
jdyson@nc.com         | and it irritates the pig.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199901222317.SAA36795>