From owner-freebsd-hackers  Tue Oct 24 22:04:40 1995
Return-Path: owner-hackers
Received: (from root@localhost)
          by freefall.freebsd.org (8.6.12/8.6.6) id WAA27712
          for hackers-outgoing; Tue, 24 Oct 1995 22:04:40 -0700
Received: from blob.best.net (blob.best.net [204.156.128.88])
          by freefall.freebsd.org (8.6.12/8.6.6) with ESMTP id WAA27706
          for <hackers@FreeBSD.ORG>; Tue, 24 Oct 1995 22:04:37 -0700
Received: from geli.clusternet (rcarter.vip.best.com [204.156.137.2]) by blob.best.net (8.6.12/8.6.5) with ESMTP id WAA08911; Tue, 24 Oct 1995 22:03:12 -0700
Received: from localhost (localhost [127.0.0.1]) by geli.clusternet (8.6.12/8.6.9) with SMTP id WAA04762; Tue, 24 Oct 1995 22:00:39 -0700
Message-Id: <199510250500.WAA04762@geli.clusternet>
X-Authentication-Warning: geli.clusternet: Host localhost didn't use HELO protocol
X-Mailer: exmh version 1.6.4 10/10/95
To: Julian Elischer <julian@ref.tfs.com>
cc: terry@lambert.org (Terry Lambert), bugs@ns1.win.net, hackers@FreeBSD.ORG
Subject: Re: process migration 
In-reply-to: Your message of "Tue, 24 Oct 1995 21:34:47 PDT."
             <199510250434.VAA16988@ref.tfs.com> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Tue, 24 Oct 1995 22:00:38 -0700
From: "Russell L. Carter" <rcarter@geli.com>
Sender: owner-hackers@FreeBSD.ORG
Precedence: bulk

> > 
> > 
> > There's also other problems:
> > 
> > 1)	File as swap store.  The executable file is acting as its own
> > 	swap store; this means you must reopen the file (which means
> > 	you need its name) and reestablish the flags on the vnode to
> > 	orevent writes to it.
> write the entire process space including non resident pages..
> (implies that shared programs become static )
> > 
> > 2)	Memory overcommit.  There very well may not be enough swap
> > 	to checkpoint the program.
> put it out to a file.......

If overcommitted ignore it.  Too bad.

> > 
> > 3)	Shared libraries.  The shared library mappings must be
> > 	restored, probably seperately.
> static.. quite possibly this might be used in a specialist environment
> (such as what russel is working on,) where shared libs might not be required
> in any case)

Righto.  Cray Research machines have been checkpointing fine for 10 years.  Of 
course, they only swap and don't page (or didn't use to, I haven't played
with the SPARC stuff).  Everything is statically linked.  Primitive
model, works fine with the bulk of *their* workload.

Would work fine with my model too, as long as it just applied to user apps.

A great deal of effort is expended to protect users from themselves, but
if they need checkpointing, the users often are very savvy about getting
themselves on the boat.  That includes app developers too.

Note: this is for jobs that run a minimum of several days, sometimes
weeks.

Regards,
Russell