From owner-freebsd-hackers@FreeBSD.ORG  Tue Feb 14 17:21:19 2012
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id D34B71065670
	for <freebsd-hackers@freebsd.org>; Tue, 14 Feb 2012 17:21:19 +0000 (UTC)
	(envelope-from mdf356@gmail.com)
Received: from mail-pw0-f54.google.com (mail-pw0-f54.google.com
	[209.85.160.54])
	by mx1.freebsd.org (Postfix) with ESMTP id A516E8FC1C
	for <freebsd-hackers@freebsd.org>; Tue, 14 Feb 2012 17:21:19 +0000 (UTC)
Received: by pbcxa7 with SMTP id xa7so716319pbc.13
	for <multiple recipients>; Tue, 14 Feb 2012 09:21:19 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=mime-version:sender:in-reply-to:references:date
	:x-google-sender-auth:message-id:subject:from:to:cc:content-type
	:content-transfer-encoding;
	bh=zNyfCjlZrMHQsfAlzrtddC/BWYrfNqUuRIzVUB21gm4=;
	b=HPqWUWNeAzTlsaPpR7x1DyvKKDwTX0DrAoaum1j+SQ9DobOzZIxSZxy8XtpX1tU/n3
	jYmY6kTmxfZgoW+a0yBO8wQFIwY0xfldPD4K9+iJH3bNL4YOVUB8ymm/jL38FPOfQgnL
	gS1OIqAiMWABi+w4SYraBS7w7kOy2rNLzgZZw=
MIME-Version: 1.0
Received: by 10.68.229.33 with SMTP id sn1mr60395352pbc.60.1329240079263; Tue,
	14 Feb 2012 09:21:19 -0800 (PST)
Sender: mdf356@gmail.com
Received: by 10.68.131.9 with HTTP; Tue, 14 Feb 2012 09:21:19 -0800 (PST)
In-Reply-To: <4F3A9266.9050905@freebsd.org>
References: <CAC46K3mc=V=oBOQnvEp9iMTyNXKD1Ki_+D0Akm8PM7rdJwDF8g@mail.gmail.com>
	<4F3A9266.9050905@freebsd.org>
Date: Tue, 14 Feb 2012 09:21:19 -0800
X-Google-Sender-Auth: RN5LVLeEPUTuPydYVTUfU6YI99g
Message-ID: <CAMBSHm_smeLhh4enPyGOGnNmd_DYYSe7ZUvZrdcFsx57p7Simw@mail.gmail.com>
From: mdf@FreeBSD.org
To: Julian Elischer <julian@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: Maninya M <maninya@gmail.com>, freebsd-hackers@freebsd.org
Subject: Re: OS support for fault tolerance
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 14 Feb 2012 17:21:19 -0000

On Tue, Feb 14, 2012 at 8:57 AM, Julian Elischer <julian@freebsd.org> wrote=
:
> On 2/14/12 6:23 AM, Maninya M wrote:
>>
>> For multicore desktop computers, suppose one of the cores fails, the
>> FreeBSD OS crashes. My question is about how I can make the OS tolerate
>> this hardware fault.
>> The strategy is to checkpoint the state of each core at specific interva=
ls
>> of time in main memory. Once a core fails, its previous state is retriev=
ed
>> from the main memory, and the processes that were running on it are
>> rescheduled on the remaining cores.
>>
>> I read that the OS tolerates faults in large servers. I need to make it =
do
>> this for a Desktop OS. I assume I would have to change the scheduler
>> program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine.
>> How do I go about doing this? What exactly do I need to save for the
>> "state" of the core? What else do I need to know?
>> I have absolutely no experience with kernel programming or with FreeBSD.
>> Any pointers to good sources about modifying the source-code of FreeBSD
>> would be greatly appreciated.
>
> This question has always intrigued me, because I'm always amazed
> that people actually try.
> From my viewpoint, There's really not much you can do if the core
> that is currently holding the scheduler lock fails.

We did this at IBM after we'd done the dynamic logical partitioning.
Basically, there was a way to probe the CPU for the number of
correctable errors it was encountering.  At too high a threshhold, it
was considered "faulty" and we offlined the CPU before it encountered
an uncorrectable error.

We did the same thing for memory, too (that one I was directly involved in)=
.

The basic trouble, though, is that at least for memory, there didn't
seem to be a correlation between the rate of correctable ECC and an
uncorrectable error occurring.

> And what do you mean by 'fails"? =A0do you run constant diagnostics?
> how do you tell when it is failed? It'd be hard to detect that 'multiply'
> has suddenly started giving bad results now and then.

I'd assume this is predicated by the ability of the hardware to have
some redundancy and some way to query the error rate.  I've done a
little work with memory ECC on the device driver end, and at least
there hardware definitely reports correctable and uncorrectable ECC
via some registers.  But I don't know if there's any way to query this
for a CPU (and of course each CPU would be different).

However, all that said, it's a moderately large project to get an OS
ready to handle things like holes appearing in its logical CPU ID
space (how do you serialize this when you want the common case to not
take a lock?), and to do all the wizardry of unscheduling (what do you
do with a bound thread?) and then actually shutting the CPU down via
firmware so it doesn't continue running.  I started working on this
for Linux when I worked at IBM, somewhere around 2004, and then IBM
got sued by SCO so they pulled me off the project.  It was finished up
by a colleague and friend.

You can probably come to a first approximation by forcing e.g. the
idle thread to not get switched out, when the CPU appears unstable.
Then at least it's running fewer instructions, and less likely to
generate a machine check.

Cheers,
matthew