From owner-freebsd-hackers@FreeBSD.ORG Tue Feb 14 23:00:19 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0F5D61065688 for ; Tue, 14 Feb 2012 23:00:19 +0000 (UTC) (envelope-from julian@freebsd.org) Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16]) by mx1.freebsd.org (Postfix) with ESMTP id EE1C98FC08 for ; Tue, 14 Feb 2012 23:00:15 +0000 (UTC) Received: from julian-mac.elischer.org (c-67-180-24-15.hsd1.ca.comcast.net [67.180.24.15]) (authenticated bits=0) by vps1.elischer.org (8.14.4/8.14.4) with ESMTP id q1EN0EW2099811 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Tue, 14 Feb 2012 15:00:15 -0800 (PST) (envelope-from julian@freebsd.org) Message-ID: <4F3AE7D9.8020204@freebsd.org> Date: Tue, 14 Feb 2012 15:01:45 -0800 From: Julian Elischer User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.4; en-US; rv:1.9.2.26) Gecko/20120129 Thunderbird/3.1.18 MIME-Version: 1.0 To: Rayson Ho References: <4F3A9266.9050905@freebsd.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Maninya M , freebsd-hackers@freebsd.org Subject: Re: OS support for fault tolerance X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 14 Feb 2012 23:00:19 -0000 On 2/14/12 9:27 AM, Rayson Ho wrote: > On Tue, Feb 14, 2012 at 11:57 AM, Julian Elischer wrote: >> but I'm interested in any answers people may have > The way other OSes handle this is by detecting any abnormal amounts of > faults (sometimes it's not the fault of the hardware - eg. when a > partical from the outerspace hits a core and flips the bit), then the > disable the core(s). > > Solaris& mainframe (z/OS) handle it this way, but you should google > and find more info since I don't remember all the details. > > Also, see this presentation: "Getting to know the Solaris Fault > Management Architecture (FMA)": > http://www.prefetch.net/presentations/SolarisFaultManagement_Presentation.pdf True, but you can't guarantee that a cpu is going to fail in a way that you can detect like that. what if the clock just stops.. I believe that even those systems that support cpu deactivation on error only catch some percentage of the problems, and that sometimes it was more of "bring up the system without cpu X after it all crashed in flames". tandem and other systems in the old day s used to be able to cope with dying cpus pretty well but they had support from to to bottom and the software was written with 'clustering' in mind. > Rayson > > ================================= > Open Grid Scheduler / Grid Engine > http://gridscheduler.sourceforge.net/ > > Scalable Grid Engine Support Program > http://www.scalablelogic.com/ > >> >>> _______________________________________________ >>> freebsd-hackers@freebsd.org mailing list >>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >>> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org" >>> >> _______________________________________________ >> freebsd-hackers@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org" > >