Date: Mon, 20 Feb 2012 13:58:21 -0500 From: "Dieter BSD" <dieterbsd@engineer.com> To: freebsd-hackers@freebsd.org Subject: Re: OS support for fault tolerance Message-ID: <20120220185822.300970@gmx.com>
next in thread | raw e-mail | index | archive | help
Rayson writes: > The question is, are we planning to handle >95% of the errors for >99% > of the hardware we run on, or are we really planning to spend years > trying to design something that would require special hardware > support? I assume this started as: "Oh look, most CPUs have multiple cores these days, maybe we could play with fault tolerance". Which could be useful if CPU cores failed a lot, but in reality what fails is disks, disks, controllers, disks, random other things, and disks. Assuming you have avoided the garbage-quality stuff, and have the system on a UPS. If you have enough ports you can add more disks and mirror or some other version of RAID. The next step is to duplicate everything. Not by looking for a mainboard with redundant everything, but by simply adding another computer. And rather than getting two of the same machine, you're better off if they are different, so that they don't have the same bugs. The problem then is how to feed both machines the same inputs, and compare the outputs. Do we need a third machine to supervise? Which then leads to the issue of how to avoid problems when *it* breaks. Can we have each machine keep an eye on the other, avoiding the need for a third machine?
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120220185822.300970>