From owner-freebsd-hackers@FreeBSD.ORG Tue Feb 14 18:56:20 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 836361065708 for ; Tue, 14 Feb 2012 18:56:20 +0000 (UTC) (envelope-from uffe@uffe.org) Received: from mail.starion.dk (mx0.starion.dk [93.162.70.34]) by mx1.freebsd.org (Postfix) with SMTP id 78DB08FC1E for ; Tue, 14 Feb 2012 18:56:19 +0000 (UTC) Received: (qmail 71859 invoked by uid 1004); 14 Feb 2012 18:44:19 -0000 Message-ID: <4F3AAB0C.1030404@uffe.org> Date: Tue, 14 Feb 2012 19:43:44 +0100 From: Uffe Jakobsen X-Mozilla-Draft-Info: internal/draft; vcard=0; receipt=0; DSN=0; uuencode=0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:9.0) Gecko/20111229 Thunderbird/9.0 MIME-Version: 1.0 To: freebsd-hackers@freebsd.org References: <4F3A9266.9050905@freebsd.org> <4F3A9622.9010708@gmail.com> In-Reply-To: <4F3A9622.9010708@gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: OS support for fault tolerance X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 14 Feb 2012 18:56:20 -0000 On 2012-02-14 18:13, Joshua Isom wrote: > On 2/14/2012 10:57 AM, Julian Elischer wrote: >> On 2/14/12 6:23 AM, Maninya M wrote: >>> For multicore desktop computers, suppose one of the cores fails, the >>> FreeBSD OS crashes. My question is about how I can make the OS tolerate >>> this hardware fault. >>> The strategy is to checkpoint the state of each core at specific >>> intervals >>> of time in main memory. Once a core fails, its previous state is >>> retrieved >>> from the main memory, and the processes that were running on it are >>> rescheduled on the remaining cores. >>> >>> I read that the OS tolerates faults in large servers. I need to make >>> it do >>> this for a Desktop OS. I assume I would have to change the scheduler >>> program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine. >>> How do I go about doing this? What exactly do I need to save for the >>> "state" of the core? What else do I need to know? >>> I have absolutely no experience with kernel programming or with FreeBSD. >>> Any pointers to good sources about modifying the source-code of FreeBSD >>> would be greatly appreciated. >> This question has always intrigued me, because I'm always amazed >> that people actually try. >> From my viewpoint, There's really not much you can do if the core >> that is currently holding the scheduler lock fails. >> And what do you mean by 'fails"? do you run constant diagnostics? >> how do you tell when it is failed? It'd be hard to detect that 'multiply' >> has suddenly started giving bad results now and then. >> >> if it just "stops" then you might be able to have a watchdog that >> notices, but what do you do when it was half way through rearranging >> a list of items? First, you have to find out that it held >> the lock for the module and then you have to find out what it had >> done and clean up the mess. >> >> This requires rewriting many many parts of the kernel to remove >> 'transient inconsistent states". and even then, what do you do if it >> was half way through manipulating some hardware.. >> >> and when you've figured that all out, how do you cope with the >> mess it made because it was dying? >> Say for example it had started calculating bad memory offsets >> before writing out some stuff and written data out over random memory? >> >> but I'm interested in any answers people may have >> > > The only way I could see that it could be done, without direct hardware > support, would be to virtualize it similar to how valgrind works. You'll > take a speed hit bad enough to want to turn it off, but it could be > possible. Testing that it works well could just mean overclocking your > cpu until it starts crashing, and then seeing if it doesn't crash. > Sun/Fujitsu SPARC64 CPUs has had "mainframe class" memory mirroring, End-to-end ECC protection, register ECC and hardware instruction retry for many years now - for the exact resaons that we discuss here - fault tolerance, (high) availability etc - typically these features are called RAS (Reliability, availability and serviceability) You can read more here: http://www.fujitsu.com/global/services/computing/server/sparcenterprise/technology/availability/processor.html /Uffe