From owner-freebsd-hackers@FreeBSD.ORG Wed Feb 15 01:59:31 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0E2BF106566C; Wed, 15 Feb 2012 01:59:31 +0000 (UTC) (envelope-from kc5vdj.freebsd@gmail.com) Received: from mail-iy0-f182.google.com (mail-iy0-f182.google.com [209.85.210.182]) by mx1.freebsd.org (Postfix) with ESMTP id BA0808FC17; Wed, 15 Feb 2012 01:59:30 +0000 (UTC) Received: by iaeo4 with SMTP id o4so908016iae.13 for ; Tue, 14 Feb 2012 17:59:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=cnRk7IJtWn3lm/g+NbxV9QzKYehLBCX+G/lt1u3wFxc=; b=qPSeYR782Cxwqnes/irmkbHZVOug0+Pf4Su8kQFwWdPeDVstmvpDSMnbSR+Z+q0UUl 3I1p0GpUC7Cvle3k2/tf7JaQdY25l/3TI4s4NOiRgqZzPAg84l0DnISmLDdr3Qlk/p0t fuGqP51eLePOaTg1Kbrzlthh9xxF1aGy1Aoo4= Received: by 10.50.181.134 with SMTP id dw6mr8373065igc.11.1329269705851; Tue, 14 Feb 2012 17:35:05 -0800 (PST) Received: from argus.electron-tube.net ([63.230.156.31]) by mx.google.com with ESMTPS id mr24sm2415857ibb.1.2012.02.14.17.35.04 (version=SSLv3 cipher=OTHER); Tue, 14 Feb 2012 17:35:05 -0800 (PST) Message-ID: <4F3B0BC7.4010804@gmail.com> Date: Tue, 14 Feb 2012 19:35:03 -0600 From: Jim Bryant User-Agent: Thunderbird 2.0.0.24 (X11/20100911) MIME-Version: 1.0 To: Julian Elischer References: <4F3A9266.9050905@freebsd.org> In-Reply-To: <4F3A9266.9050905@freebsd.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Maninya M , freebsd-hackers@freebsd.org Subject: Re: OS support for fault tolerance X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 15 Feb 2012 01:59:31 -0000 Mirrored SMP? Even NonStops require a supervisory CPU subsystem to manage what is working or not. SMP itself would have to be totally rethought. My suggestion is to study the examples of NonStop and Guardian-90. Julian Elischer wrote: > On 2/14/12 6:23 AM, Maninya M wrote: >> For multicore desktop computers, suppose one of the cores fails, the >> FreeBSD OS crashes. My question is about how I can make the OS tolerate >> this hardware fault. >> The strategy is to checkpoint the state of each core at specific >> intervals >> of time in main memory. Once a core fails, its previous state is >> retrieved >> from the main memory, and the processes that were running on it are >> rescheduled on the remaining cores. >> >> I read that the OS tolerates faults in large servers. I need to make >> it do >> this for a Desktop OS. I assume I would have to change the scheduler >> program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine. >> How do I go about doing this? What exactly do I need to save for the >> "state" of the core? What else do I need to know? >> I have absolutely no experience with kernel programming or with FreeBSD. >> Any pointers to good sources about modifying the source-code of FreeBSD >> would be greatly appreciated. > This question has always intrigued me, because I'm always amazed > that people actually try. > From my viewpoint, There's really not much you can do if the core > that is currently holding the scheduler lock fails. > And what do you mean by 'fails"? do you run constant diagnostics? > how do you tell when it is failed? It'd be hard to detect that 'multiply' > has suddenly started giving bad results now and then. > > if it just "stops" then you might be able to have a watchdog that > notices, but what do you do when it was half way through rearranging > a list of items? First, you have to find out that it held > the lock for the module and then you have to find out what it had > done and clean up the mess. > > This requires rewriting many many parts of the kernel to remove > 'transient inconsistent states". and even then, what do you do if it > was half way through manipulating some hardware.. > > and when you've figured that all out, how do you cope with the > mess it made because it was dying? > Say for example it had started calculating bad memory offsets > before writing out some stuff and written data out over random memory? > > but I'm interested in any answers people may have > >> _______________________________________________ >> freebsd-hackers@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >> To unsubscribe, send any mail to >> "freebsd-hackers-unsubscribe@freebsd.org" >> > > _______________________________________________ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to > "freebsd-hackers-unsubscribe@freebsd.org" > . >