From owner-freebsd-hackers Wed Mar 19 07:40:52 1997 Return-Path: Received: (from root@localhost) by freefall.freebsd.org (8.8.5/8.8.5) id HAA26501 for hackers-outgoing; Wed, 19 Mar 1997 07:40:52 -0800 (PST) Received: from usr04.primenet.com (root@usr04.primenet.com [206.165.5.104]) by freefall.freebsd.org (8.8.5/8.8.5) with ESMTP id HAA26496 for ; Wed, 19 Mar 1997 07:40:47 -0800 (PST) Received: from primenet.com (root@mailhost02.primenet.com [206.165.5.53]) by usr04.primenet.com (8.8.5/8.8.5) with ESMTP id IAA08929; Wed, 19 Mar 1997 08:40:35 -0700 (MST) Received: from conceptual.com (consys.com [207.218.17.187]) by primenet.com (8.8.5/8.8.5) with ESMTP id IAA25413; Wed, 19 Mar 1997 08:40:25 -0700 (MST) Received: from conceptual.com (localhost [127.0.0.1]) by conceptual.com (8.8.5/8.6.9) with ESMTP id IAA26553; Wed, 19 Mar 1997 08:40:17 -0700 (MST) Message-Id: <199703191540.IAA26553@conceptual.com> X-Mailer: exmh version 2.0gamma 1/27/96 To: Mike Pritchard cc: jkh@time.cdrom.com (Jordan K. Hubbard), hackers@freebsd.org Subject: Re: dup3() - I've thought it over and decided... In-reply-to: Your message of "Wed, 19 Mar 1997 05:57:58 PST." <199703191357.FAA22301@freefall.freebsd.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Wed, 19 Mar 1997 08:40:17 -0700 From: "Russell L. Carter" Sender: owner-hackers@freebsd.org X-Loop: FreeBSD.org Precedence: bulk > > As for Cray's implementation, yes, it allows you to create a complete > snapshot of the process, process group, or session. At this point you > could either kill the the proc/pgrp/session for later restart, or allow > it to keep running and only use the snapshot in case of a system crash. > I was involved in some work on this that allowed you to checkpoint the > process on one machine and then restart it on another for load leveling > purposes. > > It was used mainly for checkpoint/restart of long running batch > jobs submitted via NQS, but it was usable with interactive jobs > to a degree. There was on-going work for better interactive > support when I left Cray (see below). There are some other interesting things you can do with this if you have it. Fault tolerant ORBs, for instance. If you've got a mission critical long running app with enough simplicity you can periodically checkpoint to reliable storage and restart on another compatible system with a minimum of fuss should you happen to have any of a myriad number of problems with your first platform. Deep Pockets that have things that sustain damage are funding stuff like this right now :-) I've spent part of the last month looking somewhat superficially into the issues, for SGIs there's something called Hibernator that sorta works. Cray does appear to be the current state-of-the-art. Couple checkpointing/process migration with a queuing system like Codine that understands distributed environments like ORBs, PVM, MPI, etc., and you have the potential for a pretty fault tolerant, distributed computing resource based mainly on off-the-shelf hardware. For long running apps that is, ISPs are a different problem. -- Russell L. Carter Voice:(520) 636-2600 FAX:(520) 636-2888 rcarter@consys.com Conceptual Systems & Software, P.O. Box 1129 Chino Valley AZ 86323 "Before sitting down, always look for ferrets."