From owner-freebsd-hackers@FreeBSD.ORG Mon Feb 20 18:58:26 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9200C1065670 for ; Mon, 20 Feb 2012 18:58:26 +0000 (UTC) (envelope-from dieterbsd@engineer.com) Received: from mailout-us.gmx.com (mailout-us.gmx.com [74.208.5.67]) by mx1.freebsd.org (Postfix) with SMTP id 3E9A18FC0A for ; Mon, 20 Feb 2012 18:58:26 +0000 (UTC) Received: (qmail 12473 invoked by uid 0); 20 Feb 2012 18:58:25 -0000 Received: from 67.206.161.80 by rms-us018 with HTTP Content-Type: text/plain; charset="utf-8" Date: Mon, 20 Feb 2012 13:58:21 -0500 From: "Dieter BSD" Message-ID: <20120220185822.300970@gmx.com> MIME-Version: 1.0 To: freebsd-hackers@freebsd.org X-Authenticated: #74169980 X-Flags: 0001 X-Mailer: GMX.com Web Mailer x-registered: 0 Content-Transfer-Encoding: 8bit X-GMX-UID: mfIwb/UU3zOlNR3dAHAhBpF+IGRvbwAj Subject: Re: OS support for fault tolerance X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 20 Feb 2012 18:58:26 -0000 Rayson writes: > The question is, are we planning to handle >95% of the errors for >99% > of the hardware we run on, or are we really planning to spend years > trying to design something that would require special hardware > support? I assume this started as: "Oh look, most CPUs have multiple cores these days, maybe we could play with fault tolerance".  Which could be useful if CPU cores failed a lot, but in reality what fails is disks, disks, controllers, disks, random other things, and disks.  Assuming you have avoided the garbage-quality stuff, and have the system on a UPS.  If you have enough ports you can add more disks and mirror or some other version of RAID. The next step is to duplicate everything.  Not by looking for a mainboard with redundant everything, but by simply adding another computer.  And rather than getting two of the same machine, you're better off if they are different, so that they don't have the same bugs. The problem then is how to feed both machines the same inputs, and compare the outputs.  Do we need a third machine to supervise? Which then leads to the issue of how to avoid problems when *it* breaks. Can we have each machine keep an eye on the other, avoiding the need for a third machine?