From owner-freebsd-hackers@FreeBSD.ORG Fri Jan 14 02:05:23 2005 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id AC08A16A4CF for ; Fri, 14 Jan 2005 02:05:23 +0000 (GMT) Received: from afields.ca (afields.ca [216.194.67.132]) by mx1.FreeBSD.org (Postfix) with ESMTP id C4E2443D41 for ; Fri, 14 Jan 2005 02:05:18 +0000 (GMT) (envelope-from afields@afields.ca) Received: from afields.ca (localhost.afields.ca [127.0.0.1]) by afields.ca (8.12.11/8.12.11) with ESMTP id j0E25Gqa074691; Thu, 13 Jan 2005 21:05:16 -0500 (EST) (envelope-from afields@afields.ca) Received: (from afields@localhost) by afields.ca (8.12.11/8.12.11/Submit) id j0E25GL0074690; Thu, 13 Jan 2005 21:05:16 -0500 (EST) (envelope-from afields) Date: Thu, 13 Jan 2005 21:05:16 -0500 From: Allan Fields To: Brooks Davis Message-ID: <20050114020516.GD26802@afields.ca> References: <20050112214002.GA21038@odin.ac.hmc.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050112214002.GA21038@odin.ac.hmc.edu> User-Agent: Mutt/1.4i cc: freebsd-hackers@freebsd.org cc: Siddharth Aggarwal Subject: Re: process checkpoint restore facility now in DragonFly BSD X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 Jan 2005 02:05:24 -0000 On Wed, Jan 12, 2005 at 01:40:02PM -0800, Brooks Davis wrote: > On Wed, Jan 12, 2005 at 02:17:38PM -0700, Siddharth Aggarwal wrote: > > > > I am responding to a post back in Oct 2003 when the checkpointing feature > > was announced for DragonFly. I have been doing some research on this, and > > have seen some projects that use Xen VMM to achieve checkpoints of guest > > OSes. > > > > So I was looking for inputs from people as to what everyone feels about > > checkpointing, whether it should be done at the physical machine level or > > VM level. Pros and Cons of each approach, if any further development was > > done on DragonFly for checkpoint since then and if it was stopped, why? > > Are there serious limitations to checkpointing a physical machine? > > > > Sorry for such a vague posting, but I thought this would be a good > > platform to get some feedback. > > The DragonFly lists would be the logical place to discuss DragonFly > features. > > From my perspective as a scientific computing user, VM level > checkpointing is it little use since I get the overhead of the VM and > I can't easily do the application level checkpointing required to > checkpoing distributed programs. There are probably a number of places > where it is useful in scientific computing, but I don't find it to be > all that intresting. IMHO, it all depends on if process checkpointing is made practical and reliable enough to be employed for non-trivial programs. I'm not entirely convinced if a single system checkpoint is the ultimate answer though that is certainly highly desirable. One potential drawback with full system images is the lack of support for runtime checkpoints (multiple process checkpoints) and the lack of a framework for process migration and/or persistence of a subset of the processes on a system. Persistence is almost non-existent at all levels and sessioning weak. A whole solution is needed (integrating the two). The work thus far shouldn't be brushed off so easily as a multi-tiered approach could be of benefit. Each level of persistence offers it's own pros and cons: - Scope & Granularity of operation (degrees flexibility in specification, checkpoint set); - Storage options; - Interface; - Means of Coordination; - etc. For process checkpoint: The means to coordinate checkpoints and satisfy order of dependency between processes under checkpoint is a next step in the implementation path. Building on previous email: * Process Checkpointing Support: [..] An often overlooked application to process-level persistence is fault-tolerance. It might be possible to have a process survive an otherwise fatal system panic and/or hardware failure. [With-out having to resume from a whole system checkpoint.] [..] > -- Brooks > > -- > Any statement of the form "X is the one, true Y" is FALSE. > PGP fingerprint 655D 519C 26A7 82E7 2529 9BF0 5D8E 8BE9 F238 1AD4 -- Allan Fields, AFRSL - http://afields.ca 2D4F 6806 D307 0889 6125 C31D F745 0D72 39B4 5541