From owner-freebsd-hackers@FreeBSD.ORG  Tue Feb 14 17:05:51 2012
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 43433106564A
	for <freebsd-hackers@freebsd.org>; Tue, 14 Feb 2012 17:05:51 +0000 (UTC)
	(envelope-from jhellenthal@gmail.com)
Received: from mail-gy0-f182.google.com (mail-gy0-f182.google.com
	[209.85.160.182])
	by mx1.freebsd.org (Postfix) with ESMTP id E93C68FC14
	for <freebsd-hackers@freebsd.org>; Tue, 14 Feb 2012 17:05:50 +0000 (UTC)
Received: by ghbg15 with SMTP id g15so172436ghb.13
	for <multiple recipients>; Tue, 14 Feb 2012 09:05:50 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=sender:date:from:to:cc:subject:message-id:references:mime-version
	:content-type:content-disposition:in-reply-to;
	bh=yG3Nr+tYYTLwVzk3Le5RdGhehzINRNMs3L7avxx6UaY=;
	b=yIbn3DeCFGc2OGwZcrr9kLYOGIX65YOgWoD/2np/RoXVc4GI80+b+dDLAPqmP6GBUU
	V0L7F1evitf4OBjZ9aUlk9pYGJeMW0Z1XlOyMx4IPULawIJ8TW5robkk8Lq56mc8xhGA
	U/TmHVDXTFn8Bl6era/Rneq1yiYTfo54xGpzw=
Received: by 10.50.188.234 with SMTP id gd10mr5474781igc.29.1329239150069;
	Tue, 14 Feb 2012 09:05:50 -0800 (PST)
Received: from DataIX.net (adsl-99-109-126-65.dsl.klmzmi.sbcglobal.net.
	[99.109.126.65])
	by mx.google.com with ESMTPS id k3sm20385580igq.1.2012.02.14.09.05.47
	(version=TLSv1/SSLv3 cipher=OTHER);
	Tue, 14 Feb 2012 09:05:48 -0800 (PST)
Sender: Jason Hellenthal <jhellenthal@gmail.com>
Received: from DataIX.net (localhost [127.0.0.1])
	by DataIX.net (8.14.5/8.14.5) with ESMTP id q1EH5iKd062171
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Tue, 14 Feb 2012 12:05:44 -0500 (EST)
	(envelope-from jhell@DataIX.net)
Received: (from jhell@localhost)
	by DataIX.net (8.14.5/8.14.5/Submit) id q1EH5Yrh057087;
	Tue, 14 Feb 2012 12:05:34 -0500 (EST)
	(envelope-from jhell@DataIX.net)
Date: Tue, 14 Feb 2012 12:05:34 -0500
From: Jason Hellenthal <jhell@DataIX.net>
To: Julian Elischer <julian@freebsd.org>
Message-ID: <20120214170533.GA35819@DataIX.net>
References: <CAC46K3mc=V=oBOQnvEp9iMTyNXKD1Ki_+D0Akm8PM7rdJwDF8g@mail.gmail.com>
	<4F3A9266.9050905@freebsd.org>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="wac7ysb48OaltWcw"
Content-Disposition: inline
In-Reply-To: <4F3A9266.9050905@freebsd.org>
Cc: Maninya M <maninya@gmail.com>, freebsd-hackers@freebsd.org
Subject: Re: OS support for fault tolerance
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 14 Feb 2012 17:05:51 -0000


--wac7ysb48OaltWcw
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable



On Tue, Feb 14, 2012 at 08:57:10AM -0800, Julian Elischer wrote:
> On 2/14/12 6:23 AM, Maninya M wrote:
> > For multicore desktop computers, suppose one of the cores fails, the
> > FreeBSD OS crashes. My question is about how I can make the OS tolerate
> > this hardware fault.
> > The strategy is to checkpoint the state of each core at specific interv=
als
> > of time in main memory. Once a core fails, its previous state is retrie=
ved
> > from the main memory, and the processes that were running on it are
> > rescheduled on the remaining cores.
> >
> > I read that the OS tolerates faults in large servers. I need to make it=
 do
> > this for a Desktop OS. I assume I would have to change the scheduler
> > program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine.
> > How do I go about doing this? What exactly do I need to save for the
> > "state" of the core? What else do I need to know?
> > I have absolutely no experience with kernel programming or with FreeBSD.
> > Any pointers to good sources about modifying the source-code of FreeBSD
> > would be greatly appreciated.
> This question has always intrigued me, because I'm always amazed
> that people actually try.
>  From my viewpoint, There's really not much you can do if the core
> that is currently holding the scheduler lock fails.
> And what do you mean by 'fails"?  do you run constant diagnostics?
> how do you tell when it is failed? It'd be hard to detect that 'multiply'
> has suddenly started giving bad results now and then.
>=20
> if it just "stops" then you might be able to have a watchdog that
> notices,  but what do you do when it was half way through rearranging
> a list of items? First, you have to find out that it held
> the lock for the module and then you have to find out what it had
> done and clean up the mess.
>=20
> This requires rewriting many many parts of the kernel to remove
> 'transient inconsistent states". and even then, what do you do if it
> was half way through manipulating some hardware..
>=20
> and when you've figured that all out, how do you cope with the
> mess it made because it was dying?
> Say for example it had started calculating bad memory offsets
> before writing out some stuff and written data out over random memory?
>=20
> but I'm interested in any answers people may have
>=20

How about core redundancy ? effectively this would reduce the amount of
available cores in half in you spread a process to run on two cores at
the same time but with an option to adjust this per process etc... I
don't see it as unfeasable.

--=20
;s =3D;

--wac7ysb48OaltWcw
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----

iQEcBAEBAgAGBQJPOpRdAAoJEJBXh4mJ2FR+2qQH+QHC6q978koqM5Cilt7/9a1Q
ms4mTFLqzWpy/5FXbZxlhh1xbt0HeUpfIJt1r0FZ10dkLnVYaZUTPLQCTtNTopn3
+0YmolcYkxI8OaLSQhwN7It34BNAOPmjAOvgXNuwXmRhYR+L+bezGYZ15SVbuD3D
3odgtcGp/lbVeqvD8Hm6V0Zo5Qw6z2CkbZc3Rs8bzU1WI1rUWb73x0HwrgKm0kJJ
c9lT8GltiUY8ubXHlo1CqkUX+LL+WZWEtmARk+47aD1x9M/9r52T7ZlemIYvJH7K
H8rhbJX6Lz3CzeGjfSgOojiV5DTza8IPJbaoFsxmtEyQAf973ohESk5fabWeFzM=
=xF05
-----END PGP SIGNATURE-----

--wac7ysb48OaltWcw--