From owner-freebsd-hackers@FreeBSD.ORG  Wed Feb 15 00:53:39 2012
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 9CD65106566B;
	Wed, 15 Feb 2012 00:53:39 +0000 (UTC)
	(envelope-from raysonlogin@gmail.com)
Received: from mail-pw0-f54.google.com (mail-pw0-f54.google.com
	[209.85.160.54])
	by mx1.freebsd.org (Postfix) with ESMTP id 6C62D8FC20;
	Wed, 15 Feb 2012 00:53:39 +0000 (UTC)
Received: by pbcxa7 with SMTP id xa7so1078911pbc.13
	for <multiple recipients>; Tue, 14 Feb 2012 16:53:38 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=mime-version:in-reply-to:references:date:message-id:subject:from:to
	:cc:content-type;
	bh=Xb1vhurKoAhYtcGkqYlw/N7F0USx2s5X7IkpppDs4iE=;
	b=uVcTfh2xGzzntnjtnNUpEW3aRkza0gHv/fSq1XwKDzag5BtLyCxdDaiII4iOikN00M
	H6ulSrcaXDpD0MBbsV2s1dRc7UCGvn9tAUxFyPR82TH6mMXPmuvfjv3KfxoYrcC4Iykq
	iMA5vU8pSPulIcEA5f8Vrl4BlNkvJ5MrhwjP8=
MIME-Version: 1.0
Received: by 10.68.239.229 with SMTP id vv5mr63805842pbc.88.1329267218699;
	Tue, 14 Feb 2012 16:53:38 -0800 (PST)
Received: by 10.142.245.14 with HTTP; Tue, 14 Feb 2012 16:53:38 -0800 (PST)
In-Reply-To: <4F3AE7D9.8020204@freebsd.org>
References: <CAC46K3mc=V=oBOQnvEp9iMTyNXKD1Ki_+D0Akm8PM7rdJwDF8g@mail.gmail.com>
	<4F3A9266.9050905@freebsd.org>
	<CAHwLALOe1Zq86_AdO=D9pEEmOi_kT+rORMTXR-xEvhLX0Pt5gw@mail.gmail.com>
	<4F3AE7D9.8020204@freebsd.org>
Date: Tue, 14 Feb 2012 19:53:38 -0500
Message-ID: <CAHwLALMYBLdTzJxxBjdAhA9eG-oGxoCCMp1sXHRViZ6om-Au_g@mail.gmail.com>
From: Rayson Ho <raysonlogin@gmail.com>
To: freebsd-hackers@freebsd.org
Content-Type: text/plain; charset=ISO-8859-1
Cc: Maninya M <maninya@gmail.com>
Subject: Re: OS support for fault tolerance
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 15 Feb 2012 00:53:39 -0000

On Tue, Feb 14, 2012 at 6:01 PM, Julian Elischer <julian@freebsd.org> wrote:
> True, but you can't guarantee that a cpu is going to fail in a way that you
> can detect like that. what if the clock just stops..

The question is, are we planning to handle >95% of the errors for >99%
of the hardware we run on, or are we really planning to spend years
trying to design something that would require special hardware
support?

On the zSeries mainframe, the instructions are executed in locked
steps on the redundant instruction pipeline, and if the results don't
match, the instruction is re-executed again. This happens on every
load and store.

Now, if you want software to do the same thing, you will need to
somehow checkpoint the state of not only the processor, but the memory
as well, or else if the bad processor stores something to memory you
will still get corrupted data. Not only that the kernel becomes very
complicated, it would make the system very slow. And what if the
checkpointing code is executed by faulty processors??

IIRC, processors & disks don't usually just fail. That's the whole
idea behind SMART, and Fault Management in Solaris & other kernels.

http://hub.opensolaris.org/bin/view/Community+Group+fm/

Rayson

=================================
Open Grid Scheduler / Grid Engine
http://gridscheduler.sourceforge.net/

Scalable Grid Engine Support Program
http://www.scalablelogic.com/



> I believe that even those systems that
> support cpu deactivation on
> error only catch some percentage of the problems, and that sometimes it was
> more of
> "bring up the system without cpu X after it all crashed in flames".
>
> tandem and other systems in the old day s used to be able to cope with dying
> cpus pretty well
> but they had support from to to bottom and the software was written with
> 'clustering' in mind.
>
>
>
>
>
>
>> Rayson
>>
>> =================================
>> Open Grid Scheduler / Grid Engine
>> http://gridscheduler.sourceforge.net/
>>
>> Scalable Grid Engine Support Program
>> http://www.scalablelogic.com/
>>
>>>
>>>> _______________________________________________
>>>> freebsd-hackers@freebsd.org mailing list
>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>>>> To unsubscribe, send any mail to
>>>> "freebsd-hackers-unsubscribe@freebsd.org"
>>>>
>>> _______________________________________________
>>> freebsd-hackers@freebsd.org mailing list
>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>>> To unsubscribe, send any mail to
>>> "freebsd-hackers-unsubscribe@freebsd.org"
>>
>>
>>
>