From owner-freebsd-hackers@FreeBSD.ORG  Wed Feb 15 01:59:31 2012
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 0E2BF106566C;
	Wed, 15 Feb 2012 01:59:31 +0000 (UTC)
	(envelope-from kc5vdj.freebsd@gmail.com)
Received: from mail-iy0-f182.google.com (mail-iy0-f182.google.com
	[209.85.210.182])
	by mx1.freebsd.org (Postfix) with ESMTP id BA0808FC17;
	Wed, 15 Feb 2012 01:59:30 +0000 (UTC)
Received: by iaeo4 with SMTP id o4so908016iae.13
	for <multiple recipients>; Tue, 14 Feb 2012 17:59:30 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=message-id:date:from:user-agent:mime-version:to:cc:subject
	:references:in-reply-to:content-type:content-transfer-encoding;
	bh=cnRk7IJtWn3lm/g+NbxV9QzKYehLBCX+G/lt1u3wFxc=;
	b=qPSeYR782Cxwqnes/irmkbHZVOug0+Pf4Su8kQFwWdPeDVstmvpDSMnbSR+Z+q0UUl
	3I1p0GpUC7Cvle3k2/tf7JaQdY25l/3TI4s4NOiRgqZzPAg84l0DnISmLDdr3Qlk/p0t
	fuGqP51eLePOaTg1Kbrzlthh9xxF1aGy1Aoo4=
Received: by 10.50.181.134 with SMTP id dw6mr8373065igc.11.1329269705851;
	Tue, 14 Feb 2012 17:35:05 -0800 (PST)
Received: from argus.electron-tube.net ([63.230.156.31])
	by mx.google.com with ESMTPS id mr24sm2415857ibb.1.2012.02.14.17.35.04
	(version=SSLv3 cipher=OTHER); Tue, 14 Feb 2012 17:35:05 -0800 (PST)
Message-ID: <4F3B0BC7.4010804@gmail.com>
Date: Tue, 14 Feb 2012 19:35:03 -0600
From: Jim Bryant <kc5vdj.freebsd@gmail.com>
User-Agent: Thunderbird 2.0.0.24 (X11/20100911)
MIME-Version: 1.0
To: Julian Elischer <julian@freebsd.org>
References: <CAC46K3mc=V=oBOQnvEp9iMTyNXKD1Ki_+D0Akm8PM7rdJwDF8g@mail.gmail.com>
	<4F3A9266.9050905@freebsd.org>
In-Reply-To: <4F3A9266.9050905@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Maninya M <maninya@gmail.com>, freebsd-hackers@freebsd.org
Subject: Re: OS support for fault tolerance
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 15 Feb 2012 01:59:31 -0000

Mirrored SMP?  Even NonStops require a supervisory CPU subsystem to 
manage what is working or not.

SMP itself would have to be totally rethought.

My suggestion is to study the examples of NonStop and Guardian-90.

Julian Elischer wrote:
> On 2/14/12 6:23 AM, Maninya M wrote:
>> For multicore desktop computers, suppose one of the cores fails, the
>> FreeBSD OS crashes. My question is about how I can make the OS tolerate
>> this hardware fault.
>> The strategy is to checkpoint the state of each core at specific 
>> intervals
>> of time in main memory. Once a core fails, its previous state is 
>> retrieved
>> from the main memory, and the processes that were running on it are
>> rescheduled on the remaining cores.
>>
>> I read that the OS tolerates faults in large servers. I need to make 
>> it do
>> this for a Desktop OS. I assume I would have to change the scheduler
>> program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine.
>> How do I go about doing this? What exactly do I need to save for the
>> "state" of the core? What else do I need to know?
>> I have absolutely no experience with kernel programming or with FreeBSD.
>> Any pointers to good sources about modifying the source-code of FreeBSD
>> would be greatly appreciated.
> This question has always intrigued me, because I'm always amazed
> that people actually try.
> From my viewpoint, There's really not much you can do if the core
> that is currently holding the scheduler lock fails.
> And what do you mean by 'fails"?  do you run constant diagnostics?
> how do you tell when it is failed? It'd be hard to detect that 'multiply'
> has suddenly started giving bad results now and then.
>
> if it just "stops" then you might be able to have a watchdog that
> notices,  but what do you do when it was half way through rearranging
> a list of items? First, you have to find out that it held
> the lock for the module and then you have to find out what it had
> done and clean up the mess.
>
> This requires rewriting many many parts of the kernel to remove
> 'transient inconsistent states". and even then, what do you do if it
> was half way through manipulating some hardware..
>
> and when you've figured that all out, how do you cope with the
> mess it made because it was dying?
> Say for example it had started calculating bad memory offsets
> before writing out some stuff and written data out over random memory?
>
> but I'm interested in any answers people may have
>
>> _______________________________________________
>> freebsd-hackers@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>> To unsubscribe, send any mail to 
>> "freebsd-hackers-unsubscribe@freebsd.org"
>>
>
> _______________________________________________
> freebsd-hackers@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to 
> "freebsd-hackers-unsubscribe@freebsd.org"
> .
>