Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 21 Feb 2012 00:22:48 -0800
From:      Julian Elischer <julian@freebsd.org>
To:        freebsd-hackers@freebsd.org
Cc:        Da Rock <9Phackers@herveybayaustralia.com.au>
Subject:   Re: OS support for fault tolerance
Message-ID:  <4F435458.9020204@freebsd.org>
In-Reply-To: <4F425987.6010506@herveybayaustralia.com.au>
References:  <CAC46K3mc=V=oBOQnvEp9iMTyNXKD1Ki_%2BD0Akm8PM7rdJwDF8g@mail.gmail.com>	<4F3A9266.9050905@freebsd.org> <20120214170533.GA35819@DataIX.net>	<4F3A9907.8000903@gamozo.org> <4F425987.6010506@herveybayaustralia.com.au>

next in thread | previous in thread | raw e-mail | index | archive | help
On 2/20/12 6:32 AM, Da Rock wrote:
> On 02/15/12 03:25, Brandon Falk wrote:
>> On 2/14/2012 12:05 PM, Jason Hellenthal wrote:
>>> On Tue, Feb 14, 2012 at 08:57:10AM -0800, Julian Elischer wrote:
>>>> On 2/14/12 6:23 AM, Maninya M wrote:
>>>>> For multicore desktop computers, suppose one of the cores fails, 
>>>>> the
>>>>> FreeBSD OS crashes. My question is about how I can make the OS 
>>>>> tolerate
>>>>> this hardware fault.
>>>>> The strategy is to checkpoint the state of each core at specific 
>>>>> intervals
>>>>> of time in main memory. Once a core fails, its previous state is 
>>>>> retrieved
>>>>> from the main memory, and the processes that were running on it are
>>>>> rescheduled on the remaining cores.
>>>>>
>>>>> I read that the OS tolerates faults in large servers. I need to 
>>>>> make it do
>>>>> this for a Desktop OS. I assume I would have to change the 
>>>>> scheduler
>>>>> program. I am using FreeBSD 9.0 on an Intel core i5 quad core 
>>>>> machine.
>>>>> How do I go about doing this? What exactly do I need to save for 
>>>>> the
>>>>> "state" of the core? What else do I need to know?
>>>>> I have absolutely no experience with kernel programming or with 
>>>>> FreeBSD.
>>>>> Any pointers to good sources about modifying the source-code of 
>>>>> FreeBSD
>>>>> would be greatly appreciated.
>>>> This question has always intrigued me, because I'm always amazed
>>>> that people actually try.
>>>>   From my viewpoint, There's really not much you can do if the core
>>>> that is currently holding the scheduler lock fails.
>>>> And what do you mean by 'fails"?  do you run constant diagnostics?
>>>> how do you tell when it is failed? It'd be hard to detect that 
>>>> 'multiply'
>>>> has suddenly started giving bad results now and then.
>>>>
>>>> if it just "stops" then you might be able to have a watchdog that
>>>> notices,  but what do you do when it was half way through 
>>>> rearranging
>>>> a list of items? First, you have to find out that it held
>>>> the lock for the module and then you have to find out what it had
>>>> done and clean up the mess.
>>>>
>>>> This requires rewriting many many parts of the kernel to remove
>>>> 'transient inconsistent states". and even then, what do you do if it
>>>> was half way through manipulating some hardware..
>>>>
>>>> and when you've figured that all out, how do you cope with the
>>>> mess it made because it was dying?
>>>> Say for example it had started calculating bad memory offsets
>>>> before writing out some stuff and written data out over random 
>>>> memory?
>>>>
>>>> but I'm interested in any answers people may have
>>>>
>>> How about core redundancy ? effectively this would reduce the 
>>> amount of
>>> available cores in half in you spread a process to run on two 
>>> cores at
>>> the same time but with an option to adjust this per process etc... I
>>> don't see it as unfeasable.
>>>
>> The overhead for all of the error checking and redundancy makes 
>> this idea pretty
>> impractical. You'd have to have 2 cores to do the exact same thing, 
>> then some
>> 'master' core that makes sure they're doing the right stuff, and if 
>> you really
>> want to think about it... what if the core monitoring the cores 
>> fails... there's
>> a threshold of when redundancy gets pointless.
> Make no mistake here, I'm not really up with the guts of what this 
> would require (the dog may not hunt at all). Consider me as the 
> little boy throwing rocks at a hornets nest :)
>
> That out of the way, how about this scenario: why can't the master 
> be dynamic amongst the cores? 1 core be the master of any 2 cores 
> (not itself).
>
> Another thought (probably more scifi then anything else) is about 
> using the cores as individuals which work as a team and fire a weak 
> team member that is failing.
>
> I have absolutely no idea how to accomplish this, but I thought it 
> might fire a few neurons in someone who does... :)

There are so many reasons this would be ineffective on standard hardware
I have no idea where to begin, but see my email above..

>>
>> Perhaps I'm missing out on something, but you can't check the 
>> checker (without
>> infinite redundancy).
>>
>> Honestly, if you're worried about a core failing, please take your 
>> server
>> cluster out of the 1000 deg C forge.
>>
>> -Brandon
>
> _______________________________________________
> freebsd-hackers@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to 
> "freebsd-hackers-unsubscribe@freebsd.org"
>




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4F435458.9020204>