From owner-freebsd-hackers@FreeBSD.ORG Wed Feb 15 00:20:32 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 353D8106564A; Wed, 15 Feb 2012 00:20:32 +0000 (UTC) (envelope-from Devin.Teske@fisglobal.com) Received: from mx1.fisglobal.com (mx1.fisglobal.com [199.200.24.190]) by mx1.freebsd.org (Postfix) with ESMTP id EF1A38FC15; Wed, 15 Feb 2012 00:20:31 +0000 (UTC) Received: from pps.filterd (ltcfislmsgpa03 [127.0.0.1]) by ltcfislmsgpa03.fnfis.com (8.14.4/8.14.4) with SMTP id q1ENSIeC008276; Tue, 14 Feb 2012 18:20:31 -0600 Received: from smtp.fisglobal.com ([10.132.206.31]) by ltcfislmsgpa03.fnfis.com with ESMTP id 1301a8g40j-1 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT); Tue, 14 Feb 2012 18:20:31 -0600 Received: from dtwin (10.14.152.15) by smtp.fisglobal.com (10.132.206.31) with Microsoft SMTP Server (TLS) id 14.1.323.3; Tue, 14 Feb 2012 18:20:30 -0600 From: Devin Teske To: "'Julian Elischer'" , "'Rayson Ho'" References: <4F3A9266.9050905@freebsd.org> <4F3AE7D9.8020204@freebsd.org> In-Reply-To: <4F3AE7D9.8020204@freebsd.org> Date: Tue, 14 Feb 2012 16:20:35 -0800 Message-ID: <09d201cceb77$a3f46440$ebdd2cc0$@fisglobal.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Outlook 14.0 Thread-Index: AQIZPoW70/HD5iv4V1Tzgtg+6hIgrwFNVRxYAZCdF7oBWU3ui5WCt7aw Content-Language: en-us X-Originating-IP: [10.14.152.15] X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.6.7361, 1.0.260, 0.0.0000 definitions=2012-02-14_06:2012-02-14, 2012-02-14, 1970-01-01 signatures=0 Cc: 'Maninya M' , freebsd-hackers@freebsd.org Subject: RE: OS support for fault tolerance X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 15 Feb 2012 00:20:32 -0000 > -----Original Message----- > From: owner-freebsd-hackers@freebsd.org [mailto:owner-freebsd- > hackers@freebsd.org] On Behalf Of Julian Elischer > Sent: Tuesday, February 14, 2012 3:02 PM > To: Rayson Ho > Cc: Maninya M; freebsd-hackers@freebsd.org > Subject: Re: OS support for fault tolerance > > On 2/14/12 9:27 AM, Rayson Ho wrote: > > On Tue, Feb 14, 2012 at 11:57 AM, Julian Elischer wrote: > >> but I'm interested in any answers people may have > > The way other OSes handle this is by detecting any abnormal amounts of > > faults (sometimes it's not the fault of the hardware - eg. when a > > partical from the outerspace hits a core and flips the bit), then the > > disable the core(s). > > > > Solaris& mainframe (z/OS) handle it this way, but you should google > > and find more info since I don't remember all the details. > > > > Also, see this presentation: "Getting to know the Solaris Fault > > Management Architecture (FMA)": > > > http://www.prefetch.net/presentations/SolarisFaultManagement_Presentation > .pdf > True, but you can't guarantee that a cpu is going to fail in a way > that you can detect like that. > what if the clock just stops.. I believe that even those systems that > support cpu deactivation on > error only catch some percentage of the problems, and that sometimes > it was more of > "bring up the system without cpu X after it all crashed in flames". > > tandem and other systems in the old day s used to be able to cope with > dying cpus pretty well > but they had support from to to bottom and the software was written > with 'clustering' in mind. > Nowadays NEC has a their sixth-generation "Fault Tolerant (FT) Series" servers which are pretty much like the tandem servers. We got a live demo of [simulated] CPU failure and the system kept chugging along. But as Julian says, it's not guaranteed that the CPU will always fail in a predictable way (however, NEC has produced a VERY nice redundant package with 256-bit backplane to keep everything nice and lock-step). -- Devin _____________ The information contained in this message is proprietary and/or confidential. If you are not the intended recipient, please: (i) delete the message and all copies; (ii) do not disclose, distribute or use the message in any manner; and (iii) notify the sender immediately. In addition, please be aware that any message addressed to our domain is subject to archiving and review by persons other than the intended recipient. Thank you.