Date: Tue, 21 Feb 2012 00:22:48 -0800 From: Julian Elischer <julian@freebsd.org> To: freebsd-hackers@freebsd.org Cc: Da Rock <9Phackers@herveybayaustralia.com.au> Subject: Re: OS support for fault tolerance Message-ID: <4F435458.9020204@freebsd.org> In-Reply-To: <4F425987.6010506@herveybayaustralia.com.au> References: <CAC46K3mc=V=oBOQnvEp9iMTyNXKD1Ki_%2BD0Akm8PM7rdJwDF8g@mail.gmail.com> <4F3A9266.9050905@freebsd.org> <20120214170533.GA35819@DataIX.net> <4F3A9907.8000903@gamozo.org> <4F425987.6010506@herveybayaustralia.com.au>
next in thread | previous in thread | raw e-mail | index | archive | help
On 2/20/12 6:32 AM, Da Rock wrote: > On 02/15/12 03:25, Brandon Falk wrote: >> On 2/14/2012 12:05 PM, Jason Hellenthal wrote: >>> On Tue, Feb 14, 2012 at 08:57:10AM -0800, Julian Elischer wrote: >>>> On 2/14/12 6:23 AM, Maninya M wrote: >>>>> For multicore desktop computers, suppose one of the cores fails, >>>>> the >>>>> FreeBSD OS crashes. My question is about how I can make the OS >>>>> tolerate >>>>> this hardware fault. >>>>> The strategy is to checkpoint the state of each core at specific >>>>> intervals >>>>> of time in main memory. Once a core fails, its previous state is >>>>> retrieved >>>>> from the main memory, and the processes that were running on it are >>>>> rescheduled on the remaining cores. >>>>> >>>>> I read that the OS tolerates faults in large servers. I need to >>>>> make it do >>>>> this for a Desktop OS. I assume I would have to change the >>>>> scheduler >>>>> program. I am using FreeBSD 9.0 on an Intel core i5 quad core >>>>> machine. >>>>> How do I go about doing this? What exactly do I need to save for >>>>> the >>>>> "state" of the core? What else do I need to know? >>>>> I have absolutely no experience with kernel programming or with >>>>> FreeBSD. >>>>> Any pointers to good sources about modifying the source-code of >>>>> FreeBSD >>>>> would be greatly appreciated. >>>> This question has always intrigued me, because I'm always amazed >>>> that people actually try. >>>> From my viewpoint, There's really not much you can do if the core >>>> that is currently holding the scheduler lock fails. >>>> And what do you mean by 'fails"? do you run constant diagnostics? >>>> how do you tell when it is failed? It'd be hard to detect that >>>> 'multiply' >>>> has suddenly started giving bad results now and then. >>>> >>>> if it just "stops" then you might be able to have a watchdog that >>>> notices, but what do you do when it was half way through >>>> rearranging >>>> a list of items? First, you have to find out that it held >>>> the lock for the module and then you have to find out what it had >>>> done and clean up the mess. >>>> >>>> This requires rewriting many many parts of the kernel to remove >>>> 'transient inconsistent states". and even then, what do you do if it >>>> was half way through manipulating some hardware.. >>>> >>>> and when you've figured that all out, how do you cope with the >>>> mess it made because it was dying? >>>> Say for example it had started calculating bad memory offsets >>>> before writing out some stuff and written data out over random >>>> memory? >>>> >>>> but I'm interested in any answers people may have >>>> >>> How about core redundancy ? effectively this would reduce the >>> amount of >>> available cores in half in you spread a process to run on two >>> cores at >>> the same time but with an option to adjust this per process etc... I >>> don't see it as unfeasable. >>> >> The overhead for all of the error checking and redundancy makes >> this idea pretty >> impractical. You'd have to have 2 cores to do the exact same thing, >> then some >> 'master' core that makes sure they're doing the right stuff, and if >> you really >> want to think about it... what if the core monitoring the cores >> fails... there's >> a threshold of when redundancy gets pointless. > Make no mistake here, I'm not really up with the guts of what this > would require (the dog may not hunt at all). Consider me as the > little boy throwing rocks at a hornets nest :) > > That out of the way, how about this scenario: why can't the master > be dynamic amongst the cores? 1 core be the master of any 2 cores > (not itself). > > Another thought (probably more scifi then anything else) is about > using the cores as individuals which work as a team and fire a weak > team member that is failing. > > I have absolutely no idea how to accomplish this, but I thought it > might fire a few neurons in someone who does... :) There are so many reasons this would be ineffective on standard hardware I have no idea where to begin, but see my email above.. >> >> Perhaps I'm missing out on something, but you can't check the >> checker (without >> infinite redundancy). >> >> Honestly, if you're worried about a core failing, please take your >> server >> cluster out of the 1000 deg C forge. >> >> -Brandon > > _______________________________________________ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to > "freebsd-hackers-unsubscribe@freebsd.org" >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4F435458.9020204>