From owner-freebsd-hackers  Sat Sep 29  4:57: 0 2001
Delivered-To: freebsd-hackers@freebsd.org
Received: from milliways.chance.ru (milliways.chance.ru [195.190.107.35])
	by hub.freebsd.org (Postfix) with ESMTP id CE4D437B40E
	for <hackers@FreeBSD.org>; Sat, 29 Sep 2001 04:56:49 -0700 (PDT)
Received: from do-labs.spb.ru (ppp-5.chance.ru [195.190.107.8])
	by milliways.chance.ru (8.9.0/8.9.0) with SMTP id PAA16216
	for <hackers@FreeBSD.org>; Sat, 29 Sep 2001 15:56:40 +0400 (MSD)
Received: (qmail 652 invoked by uid 1000); 29 Sep 2001 15:59:41 -0000
Date: Sat, 29 Sep 2001 15:59:41 +0000
From: Vladimir Dozen <vladimir-dozen@mail.ru>
To: hackers@FreeBSD.org
Subject: VM: dynamic swap remapping (patch)
Message-ID: <20010929155941.A291@eix.do-labs.spb.ru>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.4i
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-hackers.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-hackers>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-hackers>
X-Loop: FreeBSD.ORG

ehlo.

  (Sorry for long pre-history, I believe it is necessary.)

  My current employer develops large CORBA-based data mining servers.
  They are usually run under HP-UX, but, following the current fashion
  to build processing farms, I was targeted to build version for free
  unices. Initial platform was Linux, and build itself was done smoothly,
  but very soon we were got problem: we use pthreads; to be more precise,
  we use thread-per-client model. This means that at the same time we may
  compute from single to a few tens client sessions. Each session may eat
  as much as 1G of address space, and even more (actually, there is no
  limits except for hardware ones).

  The problem was how Linux (and FreeBSD, as we discovered soon) treats
  out-of-memory (OOM) situation. 
  
  Under HPUX memory is precommited (i.e., swap is reserved for every 
  allocated page), so as soon as we get into OOM, malloc() or operator 
  new() returns NULL or throws exception, so we have opportunity to 
  unroll stack, tell client we cannot perform his request currently and, 
  most important, are able to continue execution of other clients requests.

  Linux and FreeBSD simply were killing whole our process and we have no
  any chance to know we are out of memory! All our data of all our clients
  (some of them were in processing days before) were lost. :(((((

  Very unfriendly, and, what can be more important, this kind of interaction 
  (absence of it, really) between OS and application reduces chances of 
  porting really large applications onto FreeBSD due to fact that no one 
  can trust OS that can simply trash user data with no warning.

  It seems to me, OS must use any chance to continue execution of 
  application instead of killing it. I do think it is Right Way.

  I have wrote a patch that modifies behaivour (have I spelled this
  word right? ;) of VM when we are out of memory. Instead of killing
  largest process, we remap parts of it's address space onto temporal
  files (exactly as HP-UX does when swapping into dir turned on).
  Of course, we cannot do it when we absolutely out of swap, we do it
  a bit early, when swap daemon founds swap free pages lowed to 
  nswap_lowat.

  I called this patch OOM Keeper as opposite to OOM Killer used in
  Linux (yah, I prefer BSD).

  Here is generic algorithm:

  1. Swap daemon founds vm_swap_size < nswap_lowat; it calls
     vm_oomkeeper_swap_almost_full();
  2. vm_oomkeeper_swap_almost_full() searches process having
     largest vm_object of type OBJT_SWAP, and sends it signal
     (proposed name: SIGXMEM).
  3. process gets signal, and calls special syscall (proposed
     name: remap).
  4. (we are again in kernel, this time curproc is our big process,
      in vm_oomkeeper_process).
     while free swap blocks are lower than nswap_hiwat, we
     do following:
       a) find largest object of OBJT_SWAP in current process
       b) create temporal file and unlink() it
       c) save first 1M of object into file
       d) cut first 1M of map (here we can get free swap blocks)
       e) mmap the file onto the place where the data was before.

  If any of above will fail, then old killproc() will trigger,
  so system will still be able to drop buggy processes.
  
  Note: process now has chance to do something in OOM situation.
  It can simply ignore signal, and it will be killed soon. It can
  call remap(), and it will be remapped onto files -- this will
  slow things down, but will allow to continue processing. It can
  free some space (e.g., by unmapping anonymous mmap). It can
  finally save current data and terminate, if nothing of above is
  acceptable.
  
  Note also that ulimits and quota are in action since files
  are created under process credentials.
  
  This patch was tested on my home PC with 64M RAM and 64M swap; I was
  able to run processes with _committed_ address space up to 512M
  in various scenarios: large malloc then commit, small incremental
  mallocs with immediate commit, random commit, parallel run of
  two or three such memory eaters, etc. No doubts, it requires
  additional testing.

  The patch is at whole in separate file -- vm_oomkeeper.c, and
  it requeres only single intrusion point in current code -- add
  single line in swap_pager.c:swp_sizechk().

  But, to fully implement it, I have to add new signal and new
  syscall into system. I do not want to go so far until I'll know
  if my patch acceptable for FreeBSD team.

  To make it fully controllable it would also be useful to set 
  nswap_{hi,lo}wat via sysctl interface. In any case, when using OOMK 
  these two should be raised about 4 to 8 times (from 400K to 2-4M).

  It would be also valueable if default action for SIGXMEM would be not
  SIG_IGN, but calling remap(). This requires patching of libc. Special
  environment variable ($REMAPDIR) might be used to set location of
  temporal files.
  
  I can send the vm_oomkeeper.c by request (it is 12K long, and I
  do not want to post it into mail list with no permission).

  Comments?
  
-- 
dozen @ home

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message