From owner-freebsd-hackers@FreeBSD.ORG Tue Feb 21 08:23:02 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 493D5106577F for ; Tue, 21 Feb 2012 08:22:57 +0000 (UTC) (envelope-from julian@freebsd.org) Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16]) by mx1.freebsd.org (Postfix) with ESMTP id D40978FC0A for ; Tue, 21 Feb 2012 08:22:56 +0000 (UTC) Received: from julian-mac.elischer.org (adsl-68-126-134-16.dsl.scrm01.pacbell.net [68.126.134.16]) (authenticated bits=0) by vps1.elischer.org (8.14.4/8.14.4) with ESMTP id q1L8MrpH058495 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Tue, 21 Feb 2012 00:22:54 -0800 (PST) (envelope-from julian@freebsd.org) Message-ID: <4F435458.9020204@freebsd.org> Date: Tue, 21 Feb 2012 00:22:48 -0800 From: Julian Elischer User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.4; en-US; rv:1.9.2.27) Gecko/20120216 Thunderbird/3.1.19 MIME-Version: 1.0 To: freebsd-hackers@freebsd.org References: <4F3A9266.9050905@freebsd.org> <20120214170533.GA35819@DataIX.net> <4F3A9907.8000903@gamozo.org> <4F425987.6010506@herveybayaustralia.com.au> In-Reply-To: <4F425987.6010506@herveybayaustralia.com.au> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Da Rock <9Phackers@herveybayaustralia.com.au> Subject: Re: OS support for fault tolerance X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 21 Feb 2012 08:23:02 -0000 On 2/20/12 6:32 AM, Da Rock wrote: > On 02/15/12 03:25, Brandon Falk wrote: >> On 2/14/2012 12:05 PM, Jason Hellenthal wrote: >>> On Tue, Feb 14, 2012 at 08:57:10AM -0800, Julian Elischer wrote: >>>> On 2/14/12 6:23 AM, Maninya M wrote: >>>>> For multicore desktop computers, suppose one of the cores fails, >>>>> the >>>>> FreeBSD OS crashes. My question is about how I can make the OS >>>>> tolerate >>>>> this hardware fault. >>>>> The strategy is to checkpoint the state of each core at specific >>>>> intervals >>>>> of time in main memory. Once a core fails, its previous state is >>>>> retrieved >>>>> from the main memory, and the processes that were running on it are >>>>> rescheduled on the remaining cores. >>>>> >>>>> I read that the OS tolerates faults in large servers. I need to >>>>> make it do >>>>> this for a Desktop OS. I assume I would have to change the >>>>> scheduler >>>>> program. I am using FreeBSD 9.0 on an Intel core i5 quad core >>>>> machine. >>>>> How do I go about doing this? What exactly do I need to save for >>>>> the >>>>> "state" of the core? What else do I need to know? >>>>> I have absolutely no experience with kernel programming or with >>>>> FreeBSD. >>>>> Any pointers to good sources about modifying the source-code of >>>>> FreeBSD >>>>> would be greatly appreciated. >>>> This question has always intrigued me, because I'm always amazed >>>> that people actually try. >>>> From my viewpoint, There's really not much you can do if the core >>>> that is currently holding the scheduler lock fails. >>>> And what do you mean by 'fails"? do you run constant diagnostics? >>>> how do you tell when it is failed? It'd be hard to detect that >>>> 'multiply' >>>> has suddenly started giving bad results now and then. >>>> >>>> if it just "stops" then you might be able to have a watchdog that >>>> notices, but what do you do when it was half way through >>>> rearranging >>>> a list of items? First, you have to find out that it held >>>> the lock for the module and then you have to find out what it had >>>> done and clean up the mess. >>>> >>>> This requires rewriting many many parts of the kernel to remove >>>> 'transient inconsistent states". and even then, what do you do if it >>>> was half way through manipulating some hardware.. >>>> >>>> and when you've figured that all out, how do you cope with the >>>> mess it made because it was dying? >>>> Say for example it had started calculating bad memory offsets >>>> before writing out some stuff and written data out over random >>>> memory? >>>> >>>> but I'm interested in any answers people may have >>>> >>> How about core redundancy ? effectively this would reduce the >>> amount of >>> available cores in half in you spread a process to run on two >>> cores at >>> the same time but with an option to adjust this per process etc... I >>> don't see it as unfeasable. >>> >> The overhead for all of the error checking and redundancy makes >> this idea pretty >> impractical. You'd have to have 2 cores to do the exact same thing, >> then some >> 'master' core that makes sure they're doing the right stuff, and if >> you really >> want to think about it... what if the core monitoring the cores >> fails... there's >> a threshold of when redundancy gets pointless. > Make no mistake here, I'm not really up with the guts of what this > would require (the dog may not hunt at all). Consider me as the > little boy throwing rocks at a hornets nest :) > > That out of the way, how about this scenario: why can't the master > be dynamic amongst the cores? 1 core be the master of any 2 cores > (not itself). > > Another thought (probably more scifi then anything else) is about > using the cores as individuals which work as a team and fire a weak > team member that is failing. > > I have absolutely no idea how to accomplish this, but I thought it > might fire a few neurons in someone who does... :) There are so many reasons this would be ineffective on standard hardware I have no idea where to begin, but see my email above.. >> >> Perhaps I'm missing out on something, but you can't check the >> checker (without >> infinite redundancy). >> >> Honestly, if you're worried about a core failing, please take your >> server >> cluster out of the 1000 deg C forge. >> >> -Brandon > > _______________________________________________ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to > "freebsd-hackers-unsubscribe@freebsd.org" >