Date: Wed, 15 Feb 2012 10:51:28 +1100 From: Jan Mikkelsen <janm-freebsd-hackers@transactionware.com> To: Julian Elischer <julian@freebsd.org> Cc: Maninya M <maninya@gmail.com>, freebsd-hackers@freebsd.org Subject: Re: OS support for fault tolerance Message-ID: <D2890B34-AA3E-4495-8B9F-066153BFD0CF@transactionware.com> In-Reply-To: <4F3A9266.9050905@freebsd.org> References: <CAC46K3mc=V=oBOQnvEp9iMTyNXKD1Ki_%2BD0Akm8PM7rdJwDF8g@mail.gmail.com> <4F3A9266.9050905@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 15/02/2012, at 3:57 AM, Julian Elischer wrote: > On 2/14/12 6:23 AM, Maninya M wrote: >> For multicore desktop computers, suppose one of the cores fails, the >> FreeBSD OS crashes. My question is about how I can make the OS = tolerate >> this hardware fault. >> The strategy is to checkpoint the state of each core at specific = intervals >> of time in main memory. Once a core fails, its previous state is = retrieved >> from the main memory, and the processes that were running on it are >> rescheduled on the remaining cores. >>=20 >> I read that the OS tolerates faults in large servers. I need to make = it do >> this for a Desktop OS. I assume I would have to change the scheduler >> program. I am using FreeBSD 9.0 on an Intel core i5 quad core = machine. >> How do I go about doing this? What exactly do I need to save for the >> "state" of the core? What else do I need to know? >> I have absolutely no experience with kernel programming or with = FreeBSD. >> Any pointers to good sources about modifying the source-code of = FreeBSD >> would be greatly appreciated. > This question has always intrigued me, because I'm always amazed > that people actually try. > =46rom my viewpoint, There's really not much you can do if the core > that is currently holding the scheduler lock fails. > And what do you mean by 'fails"? do you run constant diagnostics? > how do you tell when it is failed? It'd be hard to detect that = 'multiply' > has suddenly started giving bad results now and then. >=20 > if it just "stops" then you might be able to have a watchdog that > notices, but what do you do when it was half way through rearranging > a list of items? First, you have to find out that it held > the lock for the module and then you have to find out what it had > done and clean up the mess. >=20 > This requires rewriting many many parts of the kernel to remove > 'transient inconsistent states". and even then, what do you do if it > was half way through manipulating some hardware.. >=20 > and when you've figured that all out, how do you cope with the > mess it made because it was dying? > Say for example it had started calculating bad memory offsets > before writing out some stuff and written data out over random memory? >=20 > but I'm interested in any answers people may have Back in the '90s I spent a bunch of time with looking at and using = systems that dealt with this kind of failure. There are two basic approaches: With software support and without. The = basic distinction is what the hardware can do when something breaks. Is = it able to continue, or must it stop immediately? Tandem had systems with both approaches: The NonStop proprietary operating system had nodes with lock-step = processors and lots of error checking that would stop immediately when = something broke. A CPU failure turned into a node halt. There was a = bunch of work to have nodes move their state around so that terminal = sessions would not be interrupted, transactions would be rolled back, = and everything would be in a consistent state. The Integrity Unix range was based on MIPS RISC/os, with a lot of work = at Tandem. We had the R2000 and later the R3000 based systems. They had = three CPUs all in lock step with voting ("triple modular redundancy"), = and entirely duplicated memory, all with ECC. Redundant busses, separate = cabinets for controllers and separate cabinets for each side of the disk = mirror. You could pull out a CPU board and memory board, show a manager, = and then plug them back in. Tandem claimed to have removed 80% of panics from the kernel, and = changed the device driver architecture so that they could recover from = some driver faults by reinitialising driver state on a running system. We still had some outages on this system, all caused by software. It was = also expensive: AUD$1,000,000 for a system with the same underlying = CPU/memory as a $30k MIPS workstation at the time. It was also slower = because of the error checking overhead. However, it did crash much less = than the MIPS boxes. Coming back to the multicore issue: The problem when a core fails is that it has affected more than its own = state. It will be holding locks on shared resources and may have = corrupted shared memory or asked a device to do the wrong thing. By the = time you detect a fault in a core, it is too late. Checkpointing to main = memory means that you need to be able to roll back to a checkpoint, and = replay operations you know about. That involves more that CPU core = state, that includes process file and device state. The Tandem lesson is that it much easier when you involve the higher = level software in dealing with these issues. Building a system where you = can make the application programmer ignorant of the need to deal with = failure is much harder than when you expose units of work to the = application programmer and can just fail a node and replay the work = somewhere else. Transactions are your friend. Lots of literature on this stuff. My favourite is "Transaction = Processing: Concepts and Techniques" (Gray & Reuter) that has a bunch of = interesting stuff. Also stuff on the underlying techniques. I can't = recall references at the moment; they're on the bookshelf at home. Regards, Jan. janm@transactionware.com
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?D2890B34-AA3E-4495-8B9F-066153BFD0CF>