From owner-freebsd-hackers@FreeBSD.ORG Tue Feb 14 23:51:31 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B822C1065672 for ; Tue, 14 Feb 2012 23:51:31 +0000 (UTC) (envelope-from janm-freebsd-hackers@transactionware.com) Received: from midgard.transactionware.com (mail2.transactionware.com [203.14.245.36]) by mx1.freebsd.org (Postfix) with SMTP id D46D48FC1D for ; Tue, 14 Feb 2012 23:51:30 +0000 (UTC) Received: (qmail 16502 invoked by uid 907); 14 Feb 2012 23:51:28 -0000 Received: from jmmacpro.transactionware.com (HELO jmmacpro.transactionware.com) (192.168.1.33) by midgard.transactionware.com (qpsmtpd/0.82) with ESMTP; Wed, 15 Feb 2012 10:51:28 +1100 Mime-Version: 1.0 (Apple Message framework v1257) Content-Type: text/plain; charset=iso-8859-1 From: Jan Mikkelsen In-Reply-To: <4F3A9266.9050905@freebsd.org> Date: Wed, 15 Feb 2012 10:51:28 +1100 Content-Transfer-Encoding: quoted-printable Message-Id: References: <4F3A9266.9050905@freebsd.org> To: Julian Elischer X-Mailer: Apple Mail (2.1257) Cc: Maninya M , freebsd-hackers@freebsd.org Subject: Re: OS support for fault tolerance X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 14 Feb 2012 23:51:31 -0000 On 15/02/2012, at 3:57 AM, Julian Elischer wrote: > On 2/14/12 6:23 AM, Maninya M wrote: >> For multicore desktop computers, suppose one of the cores fails, the >> FreeBSD OS crashes. My question is about how I can make the OS = tolerate >> this hardware fault. >> The strategy is to checkpoint the state of each core at specific = intervals >> of time in main memory. Once a core fails, its previous state is = retrieved >> from the main memory, and the processes that were running on it are >> rescheduled on the remaining cores. >>=20 >> I read that the OS tolerates faults in large servers. I need to make = it do >> this for a Desktop OS. I assume I would have to change the scheduler >> program. I am using FreeBSD 9.0 on an Intel core i5 quad core = machine. >> How do I go about doing this? What exactly do I need to save for the >> "state" of the core? What else do I need to know? >> I have absolutely no experience with kernel programming or with = FreeBSD. >> Any pointers to good sources about modifying the source-code of = FreeBSD >> would be greatly appreciated. > This question has always intrigued me, because I'm always amazed > that people actually try. > =46rom my viewpoint, There's really not much you can do if the core > that is currently holding the scheduler lock fails. > And what do you mean by 'fails"? do you run constant diagnostics? > how do you tell when it is failed? It'd be hard to detect that = 'multiply' > has suddenly started giving bad results now and then. >=20 > if it just "stops" then you might be able to have a watchdog that > notices, but what do you do when it was half way through rearranging > a list of items? First, you have to find out that it held > the lock for the module and then you have to find out what it had > done and clean up the mess. >=20 > This requires rewriting many many parts of the kernel to remove > 'transient inconsistent states". and even then, what do you do if it > was half way through manipulating some hardware.. >=20 > and when you've figured that all out, how do you cope with the > mess it made because it was dying? > Say for example it had started calculating bad memory offsets > before writing out some stuff and written data out over random memory? >=20 > but I'm interested in any answers people may have Back in the '90s I spent a bunch of time with looking at and using = systems that dealt with this kind of failure. There are two basic approaches: With software support and without. The = basic distinction is what the hardware can do when something breaks. Is = it able to continue, or must it stop immediately? Tandem had systems with both approaches: The NonStop proprietary operating system had nodes with lock-step = processors and lots of error checking that would stop immediately when = something broke. A CPU failure turned into a node halt. There was a = bunch of work to have nodes move their state around so that terminal = sessions would not be interrupted, transactions would be rolled back, = and everything would be in a consistent state. The Integrity Unix range was based on MIPS RISC/os, with a lot of work = at Tandem. We had the R2000 and later the R3000 based systems. They had = three CPUs all in lock step with voting ("triple modular redundancy"), = and entirely duplicated memory, all with ECC. Redundant busses, separate = cabinets for controllers and separate cabinets for each side of the disk = mirror. You could pull out a CPU board and memory board, show a manager, = and then plug them back in. Tandem claimed to have removed 80% of panics from the kernel, and = changed the device driver architecture so that they could recover from = some driver faults by reinitialising driver state on a running system. We still had some outages on this system, all caused by software. It was = also expensive: AUD$1,000,000 for a system with the same underlying = CPU/memory as a $30k MIPS workstation at the time. It was also slower = because of the error checking overhead. However, it did crash much less = than the MIPS boxes. Coming back to the multicore issue: The problem when a core fails is that it has affected more than its own = state. It will be holding locks on shared resources and may have = corrupted shared memory or asked a device to do the wrong thing. By the = time you detect a fault in a core, it is too late. Checkpointing to main = memory means that you need to be able to roll back to a checkpoint, and = replay operations you know about. That involves more that CPU core = state, that includes process file and device state. The Tandem lesson is that it much easier when you involve the higher = level software in dealing with these issues. Building a system where you = can make the application programmer ignorant of the need to deal with = failure is much harder than when you expose units of work to the = application programmer and can just fail a node and replay the work = somewhere else. Transactions are your friend. Lots of literature on this stuff. My favourite is "Transaction = Processing: Concepts and Techniques" (Gray & Reuter) that has a bunch of = interesting stuff. Also stuff on the underlying techniques. I can't = recall references at the moment; they're on the bookshelf at home. Regards, Jan. janm@transactionware.com