Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 14 Feb 2012 19:39:13 -0600
From:      Jim Bryant <kc5vdj.freebsd@gmail.com>
To:        Brandon Falk <falkman@gamozo.org>
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: OS support for fault tolerance
Message-ID:  <4F3B0CC1.4050808@gmail.com>
In-Reply-To: <4F3A9907.8000903@gamozo.org>
References:  <CAC46K3mc=V=oBOQnvEp9iMTyNXKD1Ki_%2BD0Akm8PM7rdJwDF8g@mail.gmail.com>	<4F3A9266.9050905@freebsd.org> <20120214170533.GA35819@DataIX.net> <4F3A9907.8000903@gamozo.org>

next in thread | previous in thread | raw e-mail | index | archive | help


Brandon Falk wrote:
> On 2/14/2012 12:05 PM, Jason Hellenthal wrote:
>   
>> On Tue, Feb 14, 2012 at 08:57:10AM -0800, Julian Elischer wrote:
>>     
>>> On 2/14/12 6:23 AM, Maninya M wrote:
>>>       
>>>> For multicore desktop computers, suppose one of the cores fails, the
>>>> FreeBSD OS crashes. My question is about how I can make the OS tolerate
>>>> this hardware fault.
>>>> The strategy is to checkpoint the state of each core at specific intervals
>>>> of time in main memory. Once a core fails, its previous state is retrieved
>>>> from the main memory, and the processes that were running on it are
>>>> rescheduled on the remaining cores.
>>>>
>>>> I read that the OS tolerates faults in large servers. I need to make it do
>>>> this for a Desktop OS. I assume I would have to change the scheduler
>>>> program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine.
>>>> How do I go about doing this? What exactly do I need to save for the
>>>> "state" of the core? What else do I need to know?
>>>> I have absolutely no experience with kernel programming or with FreeBSD.
>>>> Any pointers to good sources about modifying the source-code of FreeBSD
>>>> would be greatly appreciated.
>>>>         
>>> This question has always intrigued me, because I'm always amazed
>>> that people actually try.
>>>  From my viewpoint, There's really not much you can do if the core
>>> that is currently holding the scheduler lock fails.
>>> And what do you mean by 'fails"?  do you run constant diagnostics?
>>> how do you tell when it is failed? It'd be hard to detect that 'multiply'
>>> has suddenly started giving bad results now and then.
>>>
>>> if it just "stops" then you might be able to have a watchdog that
>>> notices,  but what do you do when it was half way through rearranging
>>> a list of items? First, you have to find out that it held
>>> the lock for the module and then you have to find out what it had
>>> done and clean up the mess.
>>>
>>> This requires rewriting many many parts of the kernel to remove
>>> 'transient inconsistent states". and even then, what do you do if it
>>> was half way through manipulating some hardware..
>>>
>>> and when you've figured that all out, how do you cope with the
>>> mess it made because it was dying?
>>> Say for example it had started calculating bad memory offsets
>>> before writing out some stuff and written data out over random memory?
>>>
>>> but I'm interested in any answers people may have
>>>
>>>       
>> How about core redundancy ? effectively this would reduce the amount of
>> available cores in half in you spread a process to run on two cores at
>> the same time but with an option to adjust this per process etc... I
>> don't see it as unfeasable.
>>
>>     
>
> The overhead for all of the error checking and redundancy makes this idea pretty
> impractical. You'd have to have 2 cores to do the exact same thing, then some
> 'master' core that makes sure they're doing the right stuff, and if you really
> want to think about it... what if the core monitoring the cores fails... there's
> a threshold of when redundancy gets pointless.
>
> Perhaps I'm missing out on something, but you can't check the checker (without
> infinite redundancy).
>
> Honestly, if you're worried about a core failing, please take your server
> cluster out of the 1000 deg C forge.
>
> -Brandon
> _______________________________________________
> freebsd-hackers@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"
>
>   
Don't forget that cache would have to be redundant too.  The redundant 
cores must not share an on-die cache.

Oh, and the real biggie.....  What about the chipset and busses???  
Those would NOT be redundant.




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4F3B0CC1.4050808>