Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 14 Feb 2012 19:43:44 +0100
From:      Uffe Jakobsen <uffe@uffe.org>
To:        freebsd-hackers@freebsd.org
Subject:   Re: OS support for fault tolerance
Message-ID:  <4F3AAB0C.1030404@uffe.org>
In-Reply-To: <4F3A9622.9010708@gmail.com>
References:  <CAC46K3mc=V=oBOQnvEp9iMTyNXKD1Ki_%2BD0Akm8PM7rdJwDF8g@mail.gmail.com> <4F3A9266.9050905@freebsd.org> <4F3A9622.9010708@gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help


On 2012-02-14 18:13, Joshua Isom wrote:
> On 2/14/2012 10:57 AM, Julian Elischer wrote:
>> On 2/14/12 6:23 AM, Maninya M wrote:
>>> For multicore desktop computers, suppose one of the cores fails, the
>>> FreeBSD OS crashes. My question is about how I can make the OS tolerate
>>> this hardware fault.
>>> The strategy is to checkpoint the state of each core at specific
>>> intervals
>>> of time in main memory. Once a core fails, its previous state is
>>> retrieved
>>> from the main memory, and the processes that were running on it are
>>> rescheduled on the remaining cores.
>>>
>>> I read that the OS tolerates faults in large servers. I need to make
>>> it do
>>> this for a Desktop OS. I assume I would have to change the scheduler
>>> program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine.
>>> How do I go about doing this? What exactly do I need to save for the
>>> "state" of the core? What else do I need to know?
>>> I have absolutely no experience with kernel programming or with FreeBSD.
>>> Any pointers to good sources about modifying the source-code of FreeBSD
>>> would be greatly appreciated.
>> This question has always intrigued me, because I'm always amazed
>> that people actually try.
>> From my viewpoint, There's really not much you can do if the core
>> that is currently holding the scheduler lock fails.
>> And what do you mean by 'fails"? do you run constant diagnostics?
>> how do you tell when it is failed? It'd be hard to detect that 'multiply'
>> has suddenly started giving bad results now and then.
>>
>> if it just "stops" then you might be able to have a watchdog that
>> notices, but what do you do when it was half way through rearranging
>> a list of items? First, you have to find out that it held
>> the lock for the module and then you have to find out what it had
>> done and clean up the mess.
>>
>> This requires rewriting many many parts of the kernel to remove
>> 'transient inconsistent states". and even then, what do you do if it
>> was half way through manipulating some hardware..
>>
>> and when you've figured that all out, how do you cope with the
>> mess it made because it was dying?
>> Say for example it had started calculating bad memory offsets
>> before writing out some stuff and written data out over random memory?
>>
>> but I'm interested in any answers people may have
>>
>
> The only way I could see that it could be done, without direct hardware
> support, would be to virtualize it similar to how valgrind works. You'll
> take a speed hit bad enough to want to turn it off, but it could be
> possible. Testing that it works well could just mean overclocking your
> cpu until it starts crashing, and then seeing if it doesn't crash.
>


Sun/Fujitsu SPARC64 CPUs has had "mainframe class" memory mirroring, 
End-to-end ECC protection, register ECC and hardware instruction retry 
for many years now - for the exact resaons that we discuss here - fault 
tolerance, (high) availability etc - typically these features are called 
RAS (Reliability, availability and serviceability)


You can read more here:

http://www.fujitsu.com/global/services/computing/server/sparcenterprise/technology/availability/processor.html

/Uffe







Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4F3AAB0C.1030404>