From owner-freebsd-hackers  Wed Mar 19 07:40:52 1997
Return-Path: <owner-hackers>
Received: (from root@localhost)
          by freefall.freebsd.org (8.8.5/8.8.5) id HAA26501
          for hackers-outgoing; Wed, 19 Mar 1997 07:40:52 -0800 (PST)
Received: from usr04.primenet.com (root@usr04.primenet.com [206.165.5.104])
          by freefall.freebsd.org (8.8.5/8.8.5) with ESMTP id HAA26496
          for <hackers@freebsd.org>; Wed, 19 Mar 1997 07:40:47 -0800 (PST)
Received: from primenet.com (root@mailhost02.primenet.com [206.165.5.53])
	by usr04.primenet.com (8.8.5/8.8.5) with ESMTP id IAA08929;
	Wed, 19 Mar 1997 08:40:35 -0700 (MST)
Received: from conceptual.com (consys.com [207.218.17.187])
	by primenet.com (8.8.5/8.8.5) with ESMTP id IAA25413;
	Wed, 19 Mar 1997 08:40:25 -0700 (MST)
Received: from conceptual.com (localhost [127.0.0.1]) by conceptual.com (8.8.5/8.6.9) with ESMTP id IAA26553; Wed, 19 Mar 1997 08:40:17 -0700 (MST)
Message-Id: <199703191540.IAA26553@conceptual.com>
X-Mailer: exmh version 2.0gamma 1/27/96
To: Mike Pritchard <mpp@freefall.freebsd.org>
cc: jkh@time.cdrom.com (Jordan K. Hubbard), hackers@freebsd.org
Subject: Re: dup3() - I've thought it over and decided... 
In-reply-to: Your message of "Wed, 19 Mar 1997 05:57:58 PST."
             <199703191357.FAA22301@freefall.freebsd.org> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Wed, 19 Mar 1997 08:40:17 -0700
From: "Russell L. Carter" <rcarter@consys.com>
Sender: owner-hackers@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk


> 
> As for Cray's implementation, yes, it allows you to create a complete
> snapshot of the process, process group, or session.  At this point you
> could either kill the the proc/pgrp/session for later restart, or allow 
> it to keep running and only use the snapshot in case of a system crash.
> I was involved in some work on this that allowed you to checkpoint the 
> process on one machine and then restart it on another for load leveling 
> purposes.
> 
> It was used mainly for checkpoint/restart of long running batch
> jobs submitted via NQS, but it was usable with interactive jobs
> to a degree.  There was on-going work for better interactive
> support when I left Cray (see below).

There are some other interesting things you can do with this if you have it.  
Fault tolerant ORBs, for instance.  If you've got a mission critical long
running app with enough simplicity you can periodically checkpoint to reliable
storage and restart on another compatible system with a minimum of fuss
should you happen to have any of a myriad number of problems with your first
platform.  Deep Pockets that have things that sustain damage are funding stuff
like this right now :-)

I've spent part of the last month looking somewhat superficially into the 
issues, for
SGIs there's something called Hibernator that sorta works.  Cray does appear to
be the current state-of-the-art.

Couple checkpointing/process migration with a queuing system like Codine that
understands distributed environments like ORBs, PVM, MPI, etc.,
and you have the potential for a pretty fault tolerant, distributed computing 
resource based mainly on off-the-shelf hardware.

For long running apps that is, ISPs are a different problem.

-- 
Russell L. Carter

Voice:(520) 636-2600 FAX:(520) 636-2888          rcarter@consys.com
Conceptual Systems & Software,  P.O. Box 1129 Chino Valley AZ 86323
"Before sitting down, always look for ferrets."