From owner-freebsd-hackers  Tue Oct 24 22:36:53 1995
Return-Path: owner-hackers
Received: (from root@localhost)
          by freefall.freebsd.org (8.6.12/8.6.6) id WAA28895
          for hackers-outgoing; Tue, 24 Oct 1995 22:36:53 -0700
Received: from ref.tfs.com (ref.tfs.com [140.145.254.251])
          by freefall.freebsd.org (8.6.12/8.6.6) with ESMTP id WAA28890
          for <hackers@FreeBSD.ORG>; Tue, 24 Oct 1995 22:36:51 -0700
Received: (from julian@localhost) by ref.tfs.com (8.6.12/8.6.12) id WAA17892; Tue, 24 Oct 1995 22:36:23 -0700
From: Julian Elischer <julian@ref.tfs.com>
Message-Id: <199510250536.WAA17892@ref.tfs.com>
Subject: Re: process migration
To: rcarter@geli.com (Russell L. Carter)
Date: Tue, 24 Oct 1995 22:36:22 -0700 (PDT)
Cc: terry@lambert.org, bugs@ns1.win.net, hackers@FreeBSD.ORG
In-Reply-To: <199510250500.WAA04762@geli.clusternet> from "Russell L. Carter" at Oct 24, 95 10:00:38 pm
X-Mailer: ELM [version 2.4 PL24]
Content-Type: text
Content-Length: 1923      
Sender: owner-hackers@FreeBSD.ORG
Precedence: bulk

> 
> > > 
> > > 
> > > There's also other problems:
> > > 
> > > 1)	File as swap store.  The executable file is acting as its own
> > > 	swap store; this means you must reopen the file (which means
> > > 	you need its name) and reestablish the flags on the vnode to
> > > 	orevent writes to it.
> > write the entire process space including non resident pages..
> > (implies that shared programs become static )
> > > 
> > > 2)	Memory overcommit.  There very well may not be enough swap
> > > 	to checkpoint the program.
> > put it out to a file.......
> 
> If overcommitted ignore it.  Too bad.
> 
> > > 
> > > 3)	Shared libraries.  The shared library mappings must be
> > > 	restored, probably seperately.
> > static.. quite possibly this might be used in a specialist environment
> > (such as what russel is working on,) where shared libs might not be required
> > in any case)
> 
> Righto.  Cray Research machines have been checkpointing fine for 10 years.  Of 
> course, they only swap and don't page (or didn't use to, I haven't played
> with the SPARC stuff).  Everything is statically linked.  Primitive
> model, works fine with the bulk of *their* workload.
> 
> Would work fine with my model too, as long as it just applied to user apps.
> 
> A great deal of effort is expended to protect users from themselves, but
> if they need checkpointing, the users often are very savvy about getting
> themselves on the boat.  That includes app developers too.
> 
> Note: this is for jobs that run a minimum of several days, sometimes
> weeks.
I could imagine it in the form of a signal that asks the process to
pack itself up..
the system supplies the tools to do so, it just has to no which things aren't
savable.. then it does a foo() call and when it returns, it's
a week later.. and it resumes everything it stopped..

not so good for random processes, but good for specialist stuff

> 
> Regards,
> Russell
> 
> 
>