From owner-freebsd-hackers@FreeBSD.ORG Tue Feb 14 17:21:19 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D34B71065670 for ; Tue, 14 Feb 2012 17:21:19 +0000 (UTC) (envelope-from mdf356@gmail.com) Received: from mail-pw0-f54.google.com (mail-pw0-f54.google.com [209.85.160.54]) by mx1.freebsd.org (Postfix) with ESMTP id A516E8FC1C for ; Tue, 14 Feb 2012 17:21:19 +0000 (UTC) Received: by pbcxa7 with SMTP id xa7so716319pbc.13 for ; Tue, 14 Feb 2012 09:21:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=zNyfCjlZrMHQsfAlzrtddC/BWYrfNqUuRIzVUB21gm4=; b=HPqWUWNeAzTlsaPpR7x1DyvKKDwTX0DrAoaum1j+SQ9DobOzZIxSZxy8XtpX1tU/n3 jYmY6kTmxfZgoW+a0yBO8wQFIwY0xfldPD4K9+iJH3bNL4YOVUB8ymm/jL38FPOfQgnL gS1OIqAiMWABi+w4SYraBS7w7kOy2rNLzgZZw= MIME-Version: 1.0 Received: by 10.68.229.33 with SMTP id sn1mr60395352pbc.60.1329240079263; Tue, 14 Feb 2012 09:21:19 -0800 (PST) Sender: mdf356@gmail.com Received: by 10.68.131.9 with HTTP; Tue, 14 Feb 2012 09:21:19 -0800 (PST) In-Reply-To: <4F3A9266.9050905@freebsd.org> References: <4F3A9266.9050905@freebsd.org> Date: Tue, 14 Feb 2012 09:21:19 -0800 X-Google-Sender-Auth: RN5LVLeEPUTuPydYVTUfU6YI99g Message-ID: From: mdf@FreeBSD.org To: Julian Elischer Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: Maninya M , freebsd-hackers@freebsd.org Subject: Re: OS support for fault tolerance X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 14 Feb 2012 17:21:19 -0000 On Tue, Feb 14, 2012 at 8:57 AM, Julian Elischer wrote= : > On 2/14/12 6:23 AM, Maninya M wrote: >> >> For multicore desktop computers, suppose one of the cores fails, the >> FreeBSD OS crashes. My question is about how I can make the OS tolerate >> this hardware fault. >> The strategy is to checkpoint the state of each core at specific interva= ls >> of time in main memory. Once a core fails, its previous state is retriev= ed >> from the main memory, and the processes that were running on it are >> rescheduled on the remaining cores. >> >> I read that the OS tolerates faults in large servers. I need to make it = do >> this for a Desktop OS. I assume I would have to change the scheduler >> program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine. >> How do I go about doing this? What exactly do I need to save for the >> "state" of the core? What else do I need to know? >> I have absolutely no experience with kernel programming or with FreeBSD. >> Any pointers to good sources about modifying the source-code of FreeBSD >> would be greatly appreciated. > > This question has always intrigued me, because I'm always amazed > that people actually try. > From my viewpoint, There's really not much you can do if the core > that is currently holding the scheduler lock fails. We did this at IBM after we'd done the dynamic logical partitioning. Basically, there was a way to probe the CPU for the number of correctable errors it was encountering. At too high a threshhold, it was considered "faulty" and we offlined the CPU before it encountered an uncorrectable error. We did the same thing for memory, too (that one I was directly involved in)= . The basic trouble, though, is that at least for memory, there didn't seem to be a correlation between the rate of correctable ECC and an uncorrectable error occurring. > And what do you mean by 'fails"? =A0do you run constant diagnostics? > how do you tell when it is failed? It'd be hard to detect that 'multiply' > has suddenly started giving bad results now and then. I'd assume this is predicated by the ability of the hardware to have some redundancy and some way to query the error rate. I've done a little work with memory ECC on the device driver end, and at least there hardware definitely reports correctable and uncorrectable ECC via some registers. But I don't know if there's any way to query this for a CPU (and of course each CPU would be different). However, all that said, it's a moderately large project to get an OS ready to handle things like holes appearing in its logical CPU ID space (how do you serialize this when you want the common case to not take a lock?), and to do all the wizardry of unscheduling (what do you do with a bound thread?) and then actually shutting the CPU down via firmware so it doesn't continue running. I started working on this for Linux when I worked at IBM, somewhere around 2004, and then IBM got sued by SCO so they pulled me off the project. It was finished up by a colleague and friend. You can probably come to a first approximation by forcing e.g. the idle thread to not get switched out, when the CPU appears unstable. Then at least it's running fewer instructions, and less likely to generate a machine check. Cheers, matthew