Date: Fri, 22 Jan 1999 18:17:54 -0500 (EST) From: "John S. Dyson" <dyson@iquest.net> To: dillon@apollo.backplane.com (Matthew Dillon) Cc: dyson@iquest.net, hackers@FreeBSD.ORG Subject: Re: Error in vm_fault change Message-ID: <199901222317.SAA36795@y.dyson.net> In-Reply-To: <199901221953.LAA56414@apollo.backplane.com> from Matthew Dillon at "Jan 22, 99 11:53:25 am"
next in thread | previous in thread | raw e-mail | index | archive | help
Matthew Dillon said: > > Basically what it comes down to is that I do not think it is appropriate > for there to be hacks all around the kernel to arbitrarily block processes > in low memory situations. At the very worst, those same 'blockages' could > be implemented in one place - the memory allocator, and nowhere else. But > we can do much better. > That isn't a hack. > > I like your RLIMIT_RSS code, but it isn't enough and I think it is > implemented in the wrong place. > Didn't guarantee that was clean, for sure. > > We can play with the scheduling and enforcement algorithms much more > easily this way, too. What do you say ? > But don't take away the capabilities that are already in the code. The algorithms already work well, and removing things with the mistaken notion that something is being "fixed" isn't a good thing (please refer to the comment that you added with the #if 0.) vm_page_alloc isn't likely the right place to put the policy either. Note that vm_page_alloc doesn't assume any kind of process context (or shouldn't). Even though I put the object rundown in vm_page_free, that should also not be there. (The best way to do that would be a light weight kernel thread, but my kernel supports those things very inexpensively, unlike the BSD kernels.) If you want to put a layer between fault and vm_page_alloc, that *might* make sense. However, vm_fault is the only normally "approved" place that pages are read in and put into the address space (at least on a dynamic basis.) The prefaulting does also, but that is much more static and definitely more machine dependent. IMO, prefaulting should never cause competition for resources, so shouldn't be applicable to this discussion. Again, you CAN put a layer in between vm_fault and vm_page_alloc/lookup -- however the only place where the policy makes sense is in the fault code (or a routine called by it.) vm_page_alloc is called by low level kernel services that should not have policy associated with them. Note that I created the fault status block -- which can allow for layering in the fault code itself. One purpose of it was to allow for splitting the code cheaply (the other was to guarantee small offsets relative to a base pointer, shrinking the code further), and also to get rid of the UGLY macros. The fault code is much smaller than it used to be (and even smaller than UVM.) It might be a good thing to split that up -- but maybe not. Also, the fault status block might allow for handling faults through continuations -- but the rest of the VM code doesn't know how to deal with process stacks appearing and disappearing :-(. Given the VM code (and the way that it should be), the fault code is the ONLY place where process pages are/should be directly created (in normal global memory conditions.) I suggest that adding a layer between fault and alloc/lookup is probably redundant, because the place for the code is already there!!! The only thing that the code in vm_pageout should do is generally to manage global conditions (I think that soft-RSS limiting is okay there also -- but likely not optimal.) It is probably best to do the local soft-RSS limiting also where the pages are allocated and managed (in vm_fault.) One (the original) reason for putting the soft-rss limiting in vm_pageout, is that RSS limiting is generally undesirable unless memory conditions are low. However, the same algorithm can be put into vm_fault (in fact, the *exact* same code is called from vm_fault into the correct routine in vm_pageout.) Please don't go in the direction of doing local trimming as a primary page management method -- then that starts to regress to broken NT stuff. Frankly, the version of the hard-RSS limiting that I am running performs much better than earlier versions had (in the past), but I took some of the advice that I gave you and implemented those capabilities. Again, the code that you removed with '#if 0' is still operative in my kernel, and the trimming works correctly. Looking at the code that had been removed, isn't it silly to have done so, when a more direct and correct method really solves the problem? :-). One thing about the VM code that is critical for the system to run correctly, is that you must wake up the pageout daemon BEFORE you run out of memory. Waiting until the system runs out of memory will weaken the probability for the pageout daemon to run in parallel (in the sense of waiting for I/O completion, or in SMP applications) with user processes. Of course, that is also predicated upon allowing the swap_pager to block :-). If the pageout code just blasts away clean pages after trying to launder a small number of dirty pages, your system will also suffer, by having to re-read those clean pages aggressively freed. Reading pages from swap is pretty quick, but undesirable when they might have been kept with a more carefully considered pageout policy. It only takes waking up the pageout daemon a few pages early for the system to be able to take proper advantage of its behavior. On large systems, those "few" pages might number a few hundred, but who cares? The mistakes made by allowing the pageout daemon to make mistakes haphazardly cost alot because with broken algorithms, the daemon requires more I/O, even if it looks like there is more free memory. I/O and latency are the parameters that needs to be optimized, who cares about any notion of "free" memory, when it doesn't mean anything to the user -- due to excessive paging? Note that it isn't usually pageouts that make processes (systems) run slow, it is the lack of memory for page allocation, or it is latency time to read needed pages from the disk. Since you can only write out so many pages per second, and since the pageout daemon policy is to be waken up BEFORE the system is out of memory, and the cache queue (which is fairly large, given proper policy) can be used to fullfill memory needs, then the need or desirability of the pageout daemon going "nuts" by wrecklessly freeing or caching pages isn't a desirable behavior. The cache queue does buffer that errant behavior, but forgetting that it only buffers the errant behavior of the pageout daemon is very unwise. I really hope that isn't what is happening now. -- John | Never try to teach a pig to sing, dyson@iquest.net | it makes one look stupid jdyson@nc.com | and it irritates the pig. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199901222317.SAA36795>