Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 14 Feb 2012 09:21:19 -0800
From:      mdf@FreeBSD.org
To:        Julian Elischer <julian@freebsd.org>
Cc:        Maninya M <maninya@gmail.com>, freebsd-hackers@freebsd.org
Subject:   Re: OS support for fault tolerance
Message-ID:  <CAMBSHm_smeLhh4enPyGOGnNmd_DYYSe7ZUvZrdcFsx57p7Simw@mail.gmail.com>
In-Reply-To: <4F3A9266.9050905@freebsd.org>
References:  <CAC46K3mc=V=oBOQnvEp9iMTyNXKD1Ki_%2BD0Akm8PM7rdJwDF8g@mail.gmail.com> <4F3A9266.9050905@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Feb 14, 2012 at 8:57 AM, Julian Elischer <julian@freebsd.org> wrote=
:
> On 2/14/12 6:23 AM, Maninya M wrote:
>>
>> For multicore desktop computers, suppose one of the cores fails, the
>> FreeBSD OS crashes. My question is about how I can make the OS tolerate
>> this hardware fault.
>> The strategy is to checkpoint the state of each core at specific interva=
ls
>> of time in main memory. Once a core fails, its previous state is retriev=
ed
>> from the main memory, and the processes that were running on it are
>> rescheduled on the remaining cores.
>>
>> I read that the OS tolerates faults in large servers. I need to make it =
do
>> this for a Desktop OS. I assume I would have to change the scheduler
>> program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine.
>> How do I go about doing this? What exactly do I need to save for the
>> "state" of the core? What else do I need to know?
>> I have absolutely no experience with kernel programming or with FreeBSD.
>> Any pointers to good sources about modifying the source-code of FreeBSD
>> would be greatly appreciated.
>
> This question has always intrigued me, because I'm always amazed
> that people actually try.
> From my viewpoint, There's really not much you can do if the core
> that is currently holding the scheduler lock fails.

We did this at IBM after we'd done the dynamic logical partitioning.
Basically, there was a way to probe the CPU for the number of
correctable errors it was encountering.  At too high a threshhold, it
was considered "faulty" and we offlined the CPU before it encountered
an uncorrectable error.

We did the same thing for memory, too (that one I was directly involved in)=
.

The basic trouble, though, is that at least for memory, there didn't
seem to be a correlation between the rate of correctable ECC and an
uncorrectable error occurring.

> And what do you mean by 'fails"? =A0do you run constant diagnostics?
> how do you tell when it is failed? It'd be hard to detect that 'multiply'
> has suddenly started giving bad results now and then.

I'd assume this is predicated by the ability of the hardware to have
some redundancy and some way to query the error rate.  I've done a
little work with memory ECC on the device driver end, and at least
there hardware definitely reports correctable and uncorrectable ECC
via some registers.  But I don't know if there's any way to query this
for a CPU (and of course each CPU would be different).

However, all that said, it's a moderately large project to get an OS
ready to handle things like holes appearing in its logical CPU ID
space (how do you serialize this when you want the common case to not
take a lock?), and to do all the wizardry of unscheduling (what do you
do with a bound thread?) and then actually shutting the CPU down via
firmware so it doesn't continue running.  I started working on this
for Linux when I worked at IBM, somewhere around 2004, and then IBM
got sued by SCO so they pulled me off the project.  It was finished up
by a colleague and friend.

You can probably come to a first approximation by forcing e.g. the
idle thread to not get switched out, when the CPU appears unstable.
Then at least it's running fewer instructions, and less likely to
generate a machine check.

Cheers,
matthew



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAMBSHm_smeLhh4enPyGOGnNmd_DYYSe7ZUvZrdcFsx57p7Simw>