Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 14 Feb 2012 16:20:35 -0800
From:      Devin Teske <devin.teske@fisglobal.com>
To:        "'Julian Elischer'" <julian@freebsd.org>, "'Rayson Ho'" <raysonlogin@gmail.com>
Cc:        'Maninya M' <maninya@gmail.com>, freebsd-hackers@freebsd.org
Subject:   RE: OS support for fault tolerance
Message-ID:  <09d201cceb77$a3f46440$ebdd2cc0$@fisglobal.com>
In-Reply-To: <4F3AE7D9.8020204@freebsd.org>
References:  <CAC46K3mc=V=oBOQnvEp9iMTyNXKD1Ki_%2BD0Akm8PM7rdJwDF8g@mail.gmail.com>	<4F3A9266.9050905@freebsd.org>	<CAHwLALOe1Zq86_AdO=D9pEEmOi_kT%2BrORMTXR-xEvhLX0Pt5gw@mail.gmail.com> <4F3AE7D9.8020204@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help


> -----Original Message-----
> From: owner-freebsd-hackers@freebsd.org [mailto:owner-freebsd-
> hackers@freebsd.org] On Behalf Of Julian Elischer
> Sent: Tuesday, February 14, 2012 3:02 PM
> To: Rayson Ho
> Cc: Maninya M; freebsd-hackers@freebsd.org
> Subject: Re: OS support for fault tolerance
> 
> On 2/14/12 9:27 AM, Rayson Ho wrote:
> > On Tue, Feb 14, 2012 at 11:57 AM, Julian Elischer<julian@freebsd.org>
wrote:
> >> but I'm interested in any answers people may have
> > The way other OSes handle this is by detecting any abnormal amounts of
> > faults (sometimes it's not the fault of the hardware - eg. when a
> > partical from the outerspace hits a core and flips the bit), then the
> > disable the core(s).
> >
> > Solaris&  mainframe (z/OS) handle it this way, but you should google
> > and find more info since I don't remember all the details.
> >
> > Also, see this presentation: "Getting to know the Solaris Fault
> > Management Architecture (FMA)":
> >
> http://www.prefetch.net/presentations/SolarisFaultManagement_Presentation
> .pdf
> True, but you can't guarantee that a cpu is going to fail in a way
> that you can detect like that.
> what if the clock just stops..  I believe that even those systems that
> support cpu deactivation on
> error only catch some percentage of the problems, and that sometimes
> it was more of
> "bring up the system without cpu X after it all crashed in flames".
> 
> tandem and other systems in the old day s used to be able to cope with
> dying cpus pretty well
> but they had support from to to bottom and the software was written
> with 'clustering' in mind.
> 

Nowadays NEC has a their sixth-generation "Fault Tolerant (FT) Series" servers
which are pretty much like the tandem servers.

We got a live demo of [simulated] CPU failure and the system kept chugging
along.

But as Julian says, it's not guaranteed that the CPU will always fail in a
predictable way (however, NEC has produced a VERY nice redundant package with
256-bit backplane to keep everything nice and lock-step).
-- 
Devin

_____________
The information contained in this message is proprietary and/or confidential. If you are not the intended recipient, please: (i) delete the message and all copies; (ii) do not disclose, distribute or use the message in any manner; and (iii) notify the sender immediately. In addition, please be aware that any message addressed to our domain is subject to archiving and review by persons other than the intended recipient. Thank you.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?09d201cceb77$a3f46440$ebdd2cc0$>