From owner-freebsd-hackers Sat Sep 29 4:57: 0 2001 Delivered-To: freebsd-hackers@freebsd.org Received: from milliways.chance.ru (milliways.chance.ru [195.190.107.35]) by hub.freebsd.org (Postfix) with ESMTP id CE4D437B40E for ; Sat, 29 Sep 2001 04:56:49 -0700 (PDT) Received: from do-labs.spb.ru (ppp-5.chance.ru [195.190.107.8]) by milliways.chance.ru (8.9.0/8.9.0) with SMTP id PAA16216 for ; Sat, 29 Sep 2001 15:56:40 +0400 (MSD) Received: (qmail 652 invoked by uid 1000); 29 Sep 2001 15:59:41 -0000 Date: Sat, 29 Sep 2001 15:59:41 +0000 From: Vladimir Dozen To: hackers@FreeBSD.org Subject: VM: dynamic swap remapping (patch) Message-ID: <20010929155941.A291@eix.do-labs.spb.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.4i Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG ehlo. (Sorry for long pre-history, I believe it is necessary.) My current employer develops large CORBA-based data mining servers. They are usually run under HP-UX, but, following the current fashion to build processing farms, I was targeted to build version for free unices. Initial platform was Linux, and build itself was done smoothly, but very soon we were got problem: we use pthreads; to be more precise, we use thread-per-client model. This means that at the same time we may compute from single to a few tens client sessions. Each session may eat as much as 1G of address space, and even more (actually, there is no limits except for hardware ones). The problem was how Linux (and FreeBSD, as we discovered soon) treats out-of-memory (OOM) situation. Under HPUX memory is precommited (i.e., swap is reserved for every allocated page), so as soon as we get into OOM, malloc() or operator new() returns NULL or throws exception, so we have opportunity to unroll stack, tell client we cannot perform his request currently and, most important, are able to continue execution of other clients requests. Linux and FreeBSD simply were killing whole our process and we have no any chance to know we are out of memory! All our data of all our clients (some of them were in processing days before) were lost. :((((( Very unfriendly, and, what can be more important, this kind of interaction (absence of it, really) between OS and application reduces chances of porting really large applications onto FreeBSD due to fact that no one can trust OS that can simply trash user data with no warning. It seems to me, OS must use any chance to continue execution of application instead of killing it. I do think it is Right Way. I have wrote a patch that modifies behaivour (have I spelled this word right? ;) of VM when we are out of memory. Instead of killing largest process, we remap parts of it's address space onto temporal files (exactly as HP-UX does when swapping into dir turned on). Of course, we cannot do it when we absolutely out of swap, we do it a bit early, when swap daemon founds swap free pages lowed to nswap_lowat. I called this patch OOM Keeper as opposite to OOM Killer used in Linux (yah, I prefer BSD). Here is generic algorithm: 1. Swap daemon founds vm_swap_size < nswap_lowat; it calls vm_oomkeeper_swap_almost_full(); 2. vm_oomkeeper_swap_almost_full() searches process having largest vm_object of type OBJT_SWAP, and sends it signal (proposed name: SIGXMEM). 3. process gets signal, and calls special syscall (proposed name: remap). 4. (we are again in kernel, this time curproc is our big process, in vm_oomkeeper_process). while free swap blocks are lower than nswap_hiwat, we do following: a) find largest object of OBJT_SWAP in current process b) create temporal file and unlink() it c) save first 1M of object into file d) cut first 1M of map (here we can get free swap blocks) e) mmap the file onto the place where the data was before. If any of above will fail, then old killproc() will trigger, so system will still be able to drop buggy processes. Note: process now has chance to do something in OOM situation. It can simply ignore signal, and it will be killed soon. It can call remap(), and it will be remapped onto files -- this will slow things down, but will allow to continue processing. It can free some space (e.g., by unmapping anonymous mmap). It can finally save current data and terminate, if nothing of above is acceptable. Note also that ulimits and quota are in action since files are created under process credentials. This patch was tested on my home PC with 64M RAM and 64M swap; I was able to run processes with _committed_ address space up to 512M in various scenarios: large malloc then commit, small incremental mallocs with immediate commit, random commit, parallel run of two or three such memory eaters, etc. No doubts, it requires additional testing. The patch is at whole in separate file -- vm_oomkeeper.c, and it requeres only single intrusion point in current code -- add single line in swap_pager.c:swp_sizechk(). But, to fully implement it, I have to add new signal and new syscall into system. I do not want to go so far until I'll know if my patch acceptable for FreeBSD team. To make it fully controllable it would also be useful to set nswap_{hi,lo}wat via sysctl interface. In any case, when using OOMK these two should be raised about 4 to 8 times (from 400K to 2-4M). It would be also valueable if default action for SIGXMEM would be not SIG_IGN, but calling remap(). This requires patching of libc. Special environment variable ($REMAPDIR) might be used to set location of temporal files. I can send the vm_oomkeeper.c by request (it is 12K long, and I do not want to post it into mail list with no permission). Comments? -- dozen @ home To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message