From owner-freebsd-hackers@FreeBSD.ORG  Mon Feb 20 18:58:26 2012
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 9200C1065670
	for <freebsd-hackers@freebsd.org>; Mon, 20 Feb 2012 18:58:26 +0000 (UTC)
	(envelope-from dieterbsd@engineer.com)
Received: from mailout-us.gmx.com (mailout-us.gmx.com [74.208.5.67])
	by mx1.freebsd.org (Postfix) with SMTP id 3E9A18FC0A
	for <freebsd-hackers@freebsd.org>; Mon, 20 Feb 2012 18:58:26 +0000 (UTC)
Received: (qmail 12473 invoked by uid 0); 20 Feb 2012 18:58:25 -0000
Received: from 67.206.161.80 by rms-us018 with HTTP
Content-Type: text/plain; charset="utf-8"
Date: Mon, 20 Feb 2012 13:58:21 -0500
From: "Dieter BSD" <dieterbsd@engineer.com>
Message-ID: <20120220185822.300970@gmx.com>
MIME-Version: 1.0
To: freebsd-hackers@freebsd.org
X-Authenticated: #74169980
X-Flags: 0001
X-Mailer: GMX.com Web Mailer
x-registered: 0
Content-Transfer-Encoding: 8bit
X-GMX-UID: mfIwb/UU3zOlNR3dAHAhBpF+IGRvbwAj
Subject: Re: OS support for fault tolerance
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 20 Feb 2012 18:58:26 -0000

Rayson writes:
> The question is, are we planning to handle >95% of the errors for >99%
> of the hardware we run on, or are we really planning to spend years
> trying to design something that would require special hardware
> support?

I assume this started as: "Oh look, most CPUs have multiple cores
these days, maybe we could play with fault tolerance".  Which
could be useful if CPU cores failed a lot, but in reality what
fails is disks, disks, controllers, disks, random other things,
and disks.  Assuming you have avoided the garbage-quality stuff,
and have the system on a UPS.  If you have enough ports you can
add more disks and mirror or some other version of RAID.

The next step is to duplicate everything.  Not by looking for
a mainboard with redundant everything, but by simply adding
another computer.  And rather than getting two of the same machine,
you're better off if they are different, so that they don't have
the same bugs.

The problem then is how to feed both machines the same inputs,
and compare the outputs.  Do we need a third machine to supervise?
Which then leads to the issue of how to avoid problems when *it* breaks.
Can we have each machine keep an eye on the other, avoiding the
need for a third machine?