From owner-freebsd-hackers Wed Jun 19 12:47:44 2002 Delivered-To: freebsd-hackers@freebsd.org Received: from falcon.mail.pas.earthlink.net (falcon.mail.pas.earthlink.net [207.217.120.74]) by hub.freebsd.org (Postfix) with ESMTP id 0A93737B401 for ; Wed, 19 Jun 2002 12:47:40 -0700 (PDT) Received: from pool0424.cvx40-bradley.dialup.earthlink.net ([216.244.43.169] helo=mindspring.com) by falcon.mail.pas.earthlink.net with esmtp (Exim 3.33 #2) id 17KlLl-0002Hh-00; Wed, 19 Jun 2002 12:42:41 -0700 Message-ID: <3D10DE8B.F259D0BA@mindspring.com> Date: Wed, 19 Jun 2002 12:42:03 -0700 From: Terry Lambert X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Andrey Simonenko Cc: freebsd-hackers@freebsd.org Subject: Re: Is it possible to store process state and then restore process References: <006401c21793$30721750$6d36120a@pm5149> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Andrey Simonenko wrote: > Suppose there is a process, let this process doesn't have any > childs, open sockets, it has one thread, etc. But this process > can malloc() memory, open local files. Let's take very simple case. > > Is it possible to store process state to the file (i.e. say > somehow the kernel to do this), and then after rebooting restore > from the file this process back to system and continue executing it? > > I understand that it is not very simple, but I want to know if it is > possible. Are there any problem with memory addressetion? Do a web search on the two terms "checkpoint restart". You can also do a web search on the term "undump". In a general sense, this won't be able to work for any process which uses sockets, because the endpoint information will not be recoverable (in case you decide to not take the "simple case" in the future). It's possible to make it (mostly) recoverable, but it requires modifications, such as pausing the TCP stack so that after reboot, but before checkpointed jobs that will be restarted are recovered, since you don't want to be sending RST packets to the peers on network connections. As a rule, most checkpoint and restart systems that you will find out there on the net when you run the search will also not support things like re-sharing of descriptors for a set of processes that have used UNIX domain sockets to pass them, maintaining proper parent/child process relationships for things like SIGCHLD, etc.. You should assume that anything you checkpoint will be restarted on another machine halfway around the planet, without any of the other local processes running. Anything having to do with pending outstanding operations (e.g. alarms, I/O, etc.) will require OS support to recover. Since it's a lot simpler to restart a long running application *almost* where it left off, most of the useful and non-invasive packages you will find from your web search will try to do a periodic snapshot of process state, and restore it from the point of last snapshot, not failure. This will also lose any implied IPC state, so it's best if the application in question is written to open any resources, access them, close them, do the long term computation, and then open and write an output file only after it's done, rather than, say, holding the output file open. If the output file is written incrementally, you will likely end up with duplicate results, otherwise. For these reasons, and others I haven't mentioned, you will probably be most happy with "undump", unless you plan on doing a large project. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message