From owner-freebsd-hackers@FreeBSD.ORG  Tue Feb 14 23:51:31 2012
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B822C1065672
	for <freebsd-hackers@freebsd.org>; Tue, 14 Feb 2012 23:51:31 +0000 (UTC)
	(envelope-from janm-freebsd-hackers@transactionware.com)
Received: from midgard.transactionware.com (mail2.transactionware.com
	[203.14.245.36]) by mx1.freebsd.org (Postfix) with SMTP id D46D48FC1D
	for <freebsd-hackers@freebsd.org>; Tue, 14 Feb 2012 23:51:30 +0000 (UTC)
Received: (qmail 16502 invoked by uid 907); 14 Feb 2012 23:51:28 -0000
Received: from jmmacpro.transactionware.com (HELO
	jmmacpro.transactionware.com) (192.168.1.33)
	by midgard.transactionware.com (qpsmtpd/0.82) with ESMTP;
	Wed, 15 Feb 2012 10:51:28 +1100
Mime-Version: 1.0 (Apple Message framework v1257)
Content-Type: text/plain; charset=iso-8859-1
From: Jan Mikkelsen <janm-freebsd-hackers@transactionware.com>
In-Reply-To: <4F3A9266.9050905@freebsd.org>
Date: Wed, 15 Feb 2012 10:51:28 +1100
Content-Transfer-Encoding: quoted-printable
Message-Id: <D2890B34-AA3E-4495-8B9F-066153BFD0CF@transactionware.com>
References: <CAC46K3mc=V=oBOQnvEp9iMTyNXKD1Ki_+D0Akm8PM7rdJwDF8g@mail.gmail.com>
	<4F3A9266.9050905@freebsd.org>
To: Julian Elischer <julian@freebsd.org>
X-Mailer: Apple Mail (2.1257)
Cc: Maninya M <maninya@gmail.com>, freebsd-hackers@freebsd.org
Subject: Re: OS support for fault tolerance
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 14 Feb 2012 23:51:31 -0000


On 15/02/2012, at 3:57 AM, Julian Elischer wrote:

> On 2/14/12 6:23 AM, Maninya M wrote:
>> For multicore desktop computers, suppose one of the cores fails, the
>> FreeBSD OS crashes. My question is about how I can make the OS =
tolerate
>> this hardware fault.
>> The strategy is to checkpoint the state of each core at specific =
intervals
>> of time in main memory. Once a core fails, its previous state is =
retrieved
>> from the main memory, and the processes that were running on it are
>> rescheduled on the remaining cores.
>>=20
>> I read that the OS tolerates faults in large servers. I need to make =
it do
>> this for a Desktop OS. I assume I would have to change the scheduler
>> program. I am using FreeBSD 9.0 on an Intel core i5 quad core =
machine.
>> How do I go about doing this? What exactly do I need to save for the
>> "state" of the core? What else do I need to know?
>> I have absolutely no experience with kernel programming or with =
FreeBSD.
>> Any pointers to good sources about modifying the source-code of =
FreeBSD
>> would be greatly appreciated.
> This question has always intrigued me, because I'm always amazed
> that people actually try.
> =46rom my viewpoint, There's really not much you can do if the core
> that is currently holding the scheduler lock fails.
> And what do you mean by 'fails"?  do you run constant diagnostics?
> how do you tell when it is failed? It'd be hard to detect that =
'multiply'
> has suddenly started giving bad results now and then.
>=20
> if it just "stops" then you might be able to have a watchdog that
> notices,  but what do you do when it was half way through rearranging
> a list of items? First, you have to find out that it held
> the lock for the module and then you have to find out what it had
> done and clean up the mess.
>=20
> This requires rewriting many many parts of the kernel to remove
> 'transient inconsistent states". and even then, what do you do if it
> was half way through manipulating some hardware..
>=20
> and when you've figured that all out, how do you cope with the
> mess it made because it was dying?
> Say for example it had started calculating bad memory offsets
> before writing out some stuff and written data out over random memory?
>=20
> but I'm interested in any answers people may have

Back in the '90s I spent a bunch of time with looking at and using =
systems that dealt with this kind of failure.

There are two basic approaches: With software support and without. The =
basic distinction is what the hardware can do when something breaks. Is =
it able to continue, or must it stop immediately?

Tandem had systems with both approaches:

The NonStop proprietary operating system had nodes with lock-step =
processors and lots of error checking that would stop immediately when =
something broke. A CPU failure turned into a node halt. There was a =
bunch of work to have nodes move their state around so that terminal =
sessions would not be interrupted, transactions would be rolled back, =
and everything would be in a consistent state.

The Integrity Unix range was based on MIPS RISC/os, with a lot of work =
at Tandem. We had the R2000 and later the R3000 based systems. They had =
three CPUs all in lock step with voting ("triple modular redundancy"), =
and entirely duplicated memory, all with ECC. Redundant busses, separate =
cabinets for controllers and separate cabinets for each side of the disk =
mirror. You could pull out a CPU board and memory board, show a manager, =
and then plug them back in.

Tandem claimed to have removed 80% of panics from the kernel, and =
changed the device driver architecture so that they could recover from =
some driver faults by reinitialising driver state on a running system.

We still had some outages on this system, all caused by software. It was =
also expensive: AUD$1,000,000 for a system with the same underlying =
CPU/memory as a $30k MIPS workstation at the time. It was also slower =
because of the error checking overhead. However, it did crash much less =
than the MIPS boxes.

Coming back to the multicore issue:

The problem when a core fails is that it has affected more than its own =
state. It will be holding locks on shared resources and may have =
corrupted shared memory or asked a device to do the wrong thing. By the =
time you detect a fault in a core, it is too late. Checkpointing to main =
memory means that you need to be able to roll back to a checkpoint, and =
replay operations you know about. That involves more that CPU core =
state, that includes process file and device state.

The Tandem lesson is that it much easier when you involve the higher =
level software in dealing with these issues. Building a system where you =
can make the application programmer ignorant of the need to deal with =
failure is much harder than when you expose units of work to the =
application programmer and can just fail a node and replay the work =
somewhere else. Transactions are your friend.

Lots of literature on this stuff. My favourite is "Transaction =
Processing: Concepts and Techniques" (Gray & Reuter) that has a bunch of =
interesting stuff. Also stuff on the underlying techniques. I can't =
recall references at the moment; they're on the bookshelf at home.


Regards,

Jan.
janm@transactionware.com