From owner-freebsd-hackers Tue Oct 24 22:04:40 1995 Return-Path: owner-hackers Received: (from root@localhost) by freefall.freebsd.org (8.6.12/8.6.6) id WAA27712 for hackers-outgoing; Tue, 24 Oct 1995 22:04:40 -0700 Received: from blob.best.net (blob.best.net [204.156.128.88]) by freefall.freebsd.org (8.6.12/8.6.6) with ESMTP id WAA27706 for ; Tue, 24 Oct 1995 22:04:37 -0700 Received: from geli.clusternet (rcarter.vip.best.com [204.156.137.2]) by blob.best.net (8.6.12/8.6.5) with ESMTP id WAA08911; Tue, 24 Oct 1995 22:03:12 -0700 Received: from localhost (localhost [127.0.0.1]) by geli.clusternet (8.6.12/8.6.9) with SMTP id WAA04762; Tue, 24 Oct 1995 22:00:39 -0700 Message-Id: <199510250500.WAA04762@geli.clusternet> X-Authentication-Warning: geli.clusternet: Host localhost didn't use HELO protocol X-Mailer: exmh version 1.6.4 10/10/95 To: Julian Elischer cc: terry@lambert.org (Terry Lambert), bugs@ns1.win.net, hackers@FreeBSD.ORG Subject: Re: process migration In-reply-to: Your message of "Tue, 24 Oct 1995 21:34:47 PDT." <199510250434.VAA16988@ref.tfs.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Tue, 24 Oct 1995 22:00:38 -0700 From: "Russell L. Carter" Sender: owner-hackers@FreeBSD.ORG Precedence: bulk > > > > > > There's also other problems: > > > > 1) File as swap store. The executable file is acting as its own > > swap store; this means you must reopen the file (which means > > you need its name) and reestablish the flags on the vnode to > > orevent writes to it. > write the entire process space including non resident pages.. > (implies that shared programs become static ) > > > > 2) Memory overcommit. There very well may not be enough swap > > to checkpoint the program. > put it out to a file....... If overcommitted ignore it. Too bad. > > > > 3) Shared libraries. The shared library mappings must be > > restored, probably seperately. > static.. quite possibly this might be used in a specialist environment > (such as what russel is working on,) where shared libs might not be required > in any case) Righto. Cray Research machines have been checkpointing fine for 10 years. Of course, they only swap and don't page (or didn't use to, I haven't played with the SPARC stuff). Everything is statically linked. Primitive model, works fine with the bulk of *their* workload. Would work fine with my model too, as long as it just applied to user apps. A great deal of effort is expended to protect users from themselves, but if they need checkpointing, the users often are very savvy about getting themselves on the boat. That includes app developers too. Note: this is for jobs that run a minimum of several days, sometimes weeks. Regards, Russell