From owner-freebsd-hackers@FreeBSD.ORG  Tue Feb 14 16:55:46 2012
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C6CB8106564A
	for <freebsd-hackers@freebsd.org>; Tue, 14 Feb 2012 16:55:46 +0000 (UTC)
	(envelope-from julian@freebsd.org)
Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16])
	by mx1.freebsd.org (Postfix) with ESMTP id 99A6C8FC12
	for <freebsd-hackers@freebsd.org>; Tue, 14 Feb 2012 16:55:42 +0000 (UTC)
Received: from julian-mac.elischer.org (c-67-180-24-15.hsd1.ca.comcast.net
	[67.180.24.15]) (authenticated bits=0)
	by vps1.elischer.org (8.14.4/8.14.4) with ESMTP id q1EGteaJ098173
	(version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO);
	Tue, 14 Feb 2012 08:55:41 -0800 (PST)
	(envelope-from julian@freebsd.org)
Message-ID: <4F3A9266.9050905@freebsd.org>
Date: Tue, 14 Feb 2012 08:57:10 -0800
From: Julian Elischer <julian@freebsd.org>
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.4; en-US;
	rv:1.9.2.26) Gecko/20120129 Thunderbird/3.1.18
MIME-Version: 1.0
To: Maninya M <maninya@gmail.com>
References: <CAC46K3mc=V=oBOQnvEp9iMTyNXKD1Ki_+D0Akm8PM7rdJwDF8g@mail.gmail.com>
In-Reply-To: <CAC46K3mc=V=oBOQnvEp9iMTyNXKD1Ki_+D0Akm8PM7rdJwDF8g@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: freebsd-hackers@freebsd.org
Subject: Re: OS support for fault tolerance
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 14 Feb 2012 16:55:46 -0000

On 2/14/12 6:23 AM, Maninya M wrote:
> For multicore desktop computers, suppose one of the cores fails, the
> FreeBSD OS crashes. My question is about how I can make the OS tolerate
> this hardware fault.
> The strategy is to checkpoint the state of each core at specific intervals
> of time in main memory. Once a core fails, its previous state is retrieved
> from the main memory, and the processes that were running on it are
> rescheduled on the remaining cores.
>
> I read that the OS tolerates faults in large servers. I need to make it do
> this for a Desktop OS. I assume I would have to change the scheduler
> program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine.
> How do I go about doing this? What exactly do I need to save for the
> "state" of the core? What else do I need to know?
> I have absolutely no experience with kernel programming or with FreeBSD.
> Any pointers to good sources about modifying the source-code of FreeBSD
> would be greatly appreciated.
This question has always intrigued me, because I'm always amazed
that people actually try.
 From my viewpoint, There's really not much you can do if the core
that is currently holding the scheduler lock fails.
And what do you mean by 'fails"?  do you run constant diagnostics?
how do you tell when it is failed? It'd be hard to detect that 'multiply'
has suddenly started giving bad results now and then.

if it just "stops" then you might be able to have a watchdog that
notices,  but what do you do when it was half way through rearranging
a list of items? First, you have to find out that it held
the lock for the module and then you have to find out what it had
done and clean up the mess.

This requires rewriting many many parts of the kernel to remove
'transient inconsistent states". and even then, what do you do if it
was half way through manipulating some hardware..

and when you've figured that all out, how do you cope with the
mess it made because it was dying?
Say for example it had started calculating bad memory offsets
before writing out some stuff and written data out over random memory?

but I'm interested in any answers people may have

> _______________________________________________
> freebsd-hackers@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"
>