Date: Fri, 24 Feb 2012 16:10:09 -0500 From: "Dieter BSD" <dieterbsd@engineer.com> To: freebsd-hackers@freebsd.org Subject: Re: OS support for fault tolerance Message-ID: <20120224211011.300960@gmx.com>
next in thread | raw e-mail | index | archive | help
>> The problem then is how to feed both machines the same inputs, and >> compare the outputs. Do we need a third machine to supervise? >> Can we have each machine keep an eye on the other, avoiding the >> need for a third machine? > > A pair would work as long as the only failures are "obvious" (e.g. > crashes). If they simply disagree as to the result, how would we > determine which one was right? Depends on what sort of work the machine is doing. If the job is something that can be done again, you could simply try again, if you still get different answers try a third machine or wade in and start manually inspecting things until you find the problem. If the job is time critical or you can't get the same inputs again, then the machine needs to get it right the first time. How many 9s of reliability do you need and how many resources can you throw at it? 2x hardware can be good for better than 5 9s. (high quality hardware and software, and technicians standing by with cold spares) I've heard that mil gear uses 3x hardware. Building a 5 9s system is... non-trivial. So I'm wondering what sort of reliability we can get with 2x off the shelf commodity hardware and a bit of software? Similar to mirroring/RAID but with whole computers rather than just disks. Classic Unix technique of doing 10-20% of the work and getting 80-90% of the result. >> Which then leads to the issue of how to avoid problems when *it* >> breaks. > > For some reason, this reminds me of a Dr. Seuss story: > http://www.goodreads.com/review/show/49519038 *grin* Gotta love Dr. Seuss.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120224211011.300960>