From owner-freebsd-hackers@FreeBSD.ORG Wed Feb 15 01:45:35 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5991D106568A for ; Wed, 15 Feb 2012 01:45:35 +0000 (UTC) (envelope-from kc5vdj.freebsd@gmail.com) Received: from mail-iy0-f182.google.com (mail-iy0-f182.google.com [209.85.210.182]) by mx1.freebsd.org (Postfix) with ESMTP id 1D9398FC15 for ; Wed, 15 Feb 2012 01:45:34 +0000 (UTC) Received: by iaeo4 with SMTP id o4so889571iae.13 for ; Tue, 14 Feb 2012 17:45:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=dJj4X2tNbiEqGNy1Og4fOyXQJ0Iw4fGQCkJqdBKwEQE=; b=qHzqw0kehAs8tyM+3oADhYq81gbWSRAk3JTBb6Wq9yoBk7abt4Y0c9btQmVHaPLQG8 gxkLblVYm07mU2EKNtbkZ47TaUzqKV9rQIVjjcrkiAmtCIf6zaRCQTV0CzpbgFVhzeQy IweSuH6vF2UrFcSH0d+zNJD7kZZXHZa6FxlqE= Received: by 10.50.57.163 with SMTP id j3mr38878145igq.3.1329269956041; Tue, 14 Feb 2012 17:39:16 -0800 (PST) Received: from argus.electron-tube.net ([63.230.156.31]) by mx.google.com with ESMTPS id f8sm2400182ibl.6.2012.02.14.17.39.14 (version=SSLv3 cipher=OTHER); Tue, 14 Feb 2012 17:39:15 -0800 (PST) Message-ID: <4F3B0CC1.4050808@gmail.com> Date: Tue, 14 Feb 2012 19:39:13 -0600 From: Jim Bryant User-Agent: Thunderbird 2.0.0.24 (X11/20100911) MIME-Version: 1.0 To: Brandon Falk References: <4F3A9266.9050905@freebsd.org> <20120214170533.GA35819@DataIX.net> <4F3A9907.8000903@gamozo.org> In-Reply-To: <4F3A9907.8000903@gamozo.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-hackers@freebsd.org Subject: Re: OS support for fault tolerance X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 15 Feb 2012 01:45:35 -0000 Brandon Falk wrote: > On 2/14/2012 12:05 PM, Jason Hellenthal wrote: > >> On Tue, Feb 14, 2012 at 08:57:10AM -0800, Julian Elischer wrote: >> >>> On 2/14/12 6:23 AM, Maninya M wrote: >>> >>>> For multicore desktop computers, suppose one of the cores fails, the >>>> FreeBSD OS crashes. My question is about how I can make the OS tolerate >>>> this hardware fault. >>>> The strategy is to checkpoint the state of each core at specific intervals >>>> of time in main memory. Once a core fails, its previous state is retrieved >>>> from the main memory, and the processes that were running on it are >>>> rescheduled on the remaining cores. >>>> >>>> I read that the OS tolerates faults in large servers. I need to make it do >>>> this for a Desktop OS. I assume I would have to change the scheduler >>>> program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine. >>>> How do I go about doing this? What exactly do I need to save for the >>>> "state" of the core? What else do I need to know? >>>> I have absolutely no experience with kernel programming or with FreeBSD. >>>> Any pointers to good sources about modifying the source-code of FreeBSD >>>> would be greatly appreciated. >>>> >>> This question has always intrigued me, because I'm always amazed >>> that people actually try. >>> From my viewpoint, There's really not much you can do if the core >>> that is currently holding the scheduler lock fails. >>> And what do you mean by 'fails"? do you run constant diagnostics? >>> how do you tell when it is failed? It'd be hard to detect that 'multiply' >>> has suddenly started giving bad results now and then. >>> >>> if it just "stops" then you might be able to have a watchdog that >>> notices, but what do you do when it was half way through rearranging >>> a list of items? First, you have to find out that it held >>> the lock for the module and then you have to find out what it had >>> done and clean up the mess. >>> >>> This requires rewriting many many parts of the kernel to remove >>> 'transient inconsistent states". and even then, what do you do if it >>> was half way through manipulating some hardware.. >>> >>> and when you've figured that all out, how do you cope with the >>> mess it made because it was dying? >>> Say for example it had started calculating bad memory offsets >>> before writing out some stuff and written data out over random memory? >>> >>> but I'm interested in any answers people may have >>> >>> >> How about core redundancy ? effectively this would reduce the amount of >> available cores in half in you spread a process to run on two cores at >> the same time but with an option to adjust this per process etc... I >> don't see it as unfeasable. >> >> > > The overhead for all of the error checking and redundancy makes this idea pretty > impractical. You'd have to have 2 cores to do the exact same thing, then some > 'master' core that makes sure they're doing the right stuff, and if you really > want to think about it... what if the core monitoring the cores fails... there's > a threshold of when redundancy gets pointless. > > Perhaps I'm missing out on something, but you can't check the checker (without > infinite redundancy). > > Honestly, if you're worried about a core failing, please take your server > cluster out of the 1000 deg C forge. > > -Brandon > _______________________________________________ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org" > > Don't forget that cache would have to be redundant too. The redundant cores must not share an on-die cache. Oh, and the real biggie..... What about the chipset and busses??? Those would NOT be redundant.