Date: Tue, 14 Feb 2012 12:05:34 -0500 From: Jason Hellenthal <jhell@DataIX.net> To: Julian Elischer <julian@freebsd.org> Cc: Maninya M <maninya@gmail.com>, freebsd-hackers@freebsd.org Subject: Re: OS support for fault tolerance Message-ID: <20120214170533.GA35819@DataIX.net> In-Reply-To: <4F3A9266.9050905@freebsd.org> References: <CAC46K3mc=V=oBOQnvEp9iMTyNXKD1Ki_%2BD0Akm8PM7rdJwDF8g@mail.gmail.com> <4F3A9266.9050905@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
--wac7ysb48OaltWcw Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Feb 14, 2012 at 08:57:10AM -0800, Julian Elischer wrote: > On 2/14/12 6:23 AM, Maninya M wrote: > > For multicore desktop computers, suppose one of the cores fails, the > > FreeBSD OS crashes. My question is about how I can make the OS tolerate > > this hardware fault. > > The strategy is to checkpoint the state of each core at specific interv= als > > of time in main memory. Once a core fails, its previous state is retrie= ved > > from the main memory, and the processes that were running on it are > > rescheduled on the remaining cores. > > > > I read that the OS tolerates faults in large servers. I need to make it= do > > this for a Desktop OS. I assume I would have to change the scheduler > > program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine. > > How do I go about doing this? What exactly do I need to save for the > > "state" of the core? What else do I need to know? > > I have absolutely no experience with kernel programming or with FreeBSD. > > Any pointers to good sources about modifying the source-code of FreeBSD > > would be greatly appreciated. > This question has always intrigued me, because I'm always amazed > that people actually try. > From my viewpoint, There's really not much you can do if the core > that is currently holding the scheduler lock fails. > And what do you mean by 'fails"? do you run constant diagnostics? > how do you tell when it is failed? It'd be hard to detect that 'multiply' > has suddenly started giving bad results now and then. >=20 > if it just "stops" then you might be able to have a watchdog that > notices, but what do you do when it was half way through rearranging > a list of items? First, you have to find out that it held > the lock for the module and then you have to find out what it had > done and clean up the mess. >=20 > This requires rewriting many many parts of the kernel to remove > 'transient inconsistent states". and even then, what do you do if it > was half way through manipulating some hardware.. >=20 > and when you've figured that all out, how do you cope with the > mess it made because it was dying? > Say for example it had started calculating bad memory offsets > before writing out some stuff and written data out over random memory? >=20 > but I'm interested in any answers people may have >=20 How about core redundancy ? effectively this would reduce the amount of available cores in half in you spread a process to run on two cores at the same time but with an option to adjust this per process etc... I don't see it as unfeasable. --=20 ;s =3D; --wac7ysb48OaltWcw Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- iQEcBAEBAgAGBQJPOpRdAAoJEJBXh4mJ2FR+2qQH+QHC6q978koqM5Cilt7/9a1Q ms4mTFLqzWpy/5FXbZxlhh1xbt0HeUpfIJt1r0FZ10dkLnVYaZUTPLQCTtNTopn3 +0YmolcYkxI8OaLSQhwN7It34BNAOPmjAOvgXNuwXmRhYR+L+bezGYZ15SVbuD3D 3odgtcGp/lbVeqvD8Hm6V0Zo5Qw6z2CkbZc3Rs8bzU1WI1rUWb73x0HwrgKm0kJJ c9lT8GltiUY8ubXHlo1CqkUX+LL+WZWEtmARk+47aD1x9M/9r52T7ZlemIYvJH7K H8rhbJX6Lz3CzeGjfSgOojiV5DTza8IPJbaoFsxmtEyQAf973ohESk5fabWeFzM= =xF05 -----END PGP SIGNATURE----- --wac7ysb48OaltWcw--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120214170533.GA35819>