From owner-freebsd-current@FreeBSD.ORG  Thu Apr  8 15:22:10 2004
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 8D97716A4CE
	for <current@freebsd.org>; Thu,  8 Apr 2004 15:22:10 -0700 (PDT)
Received: from mail001.syd.optusnet.com.au (mail001.syd.optusnet.com.au
	[211.29.132.142])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 987F243D1F
	for <current@freebsd.org>; Thu,  8 Apr 2004 15:22:07 -0700 (PDT)
	(envelope-from peterjeremy@optushome.com.au)
Received: from server.vk2pj.dyndns.org
	(c211-30-75-229.belrs2.nsw.optusnet.com.au [211.30.75.229])
	i38MM1o32606;	Fri, 9 Apr 2004 08:22:01 +1000
Received: from server.vk2pj.dyndns.org (localhost.vk2pj.dyndns.org
	[127.0.0.1])i38MM1Ru008049;	Fri, 9 Apr 2004 08:22:01 +1000 (EST)
	(envelope-from peter@server.vk2pj.dyndns.org)
Received: (from peter@localhost)
	by server.vk2pj.dyndns.org (8.12.10/8.12.10/Submit) id i38MM0br008048;
	Fri, 9 Apr 2004 08:22:00 +1000 (EST)
	(envelope-from peter)
Date: Fri, 9 Apr 2004 08:22:00 +1000
From: Peter Jeremy <peterjeremy@optushome.com.au>
To: ticso@cicely.de
Message-ID: <20040408222200.GD6458@server.vk2pj.dyndns.org>
References: <Pine.NEB.3.96L.1040408001234.39416A-100000@fledge.watson.org>
	<20040408091030.GA6458@server.vk2pj.dyndns.org> <40751A74.50504@freebsd.org>
	<20040408114441.GB6458@server.vk2pj.dyndns.org>
	<20040408142742.GD5279@cicely12.cicely.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20040408142742.GD5279@cicely12.cicely.de>
User-Agent: Mutt/1.4.2.1i
cc: current@freebsd.org
Subject: Re: panic on one cpu leaves others running...
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 08 Apr 2004 22:22:10 -0000

On Thu, Apr 08, 2004 at 04:27:43PM +0200, Bernd Walter wrote:
>On Thu, Apr 08, 2004 at 09:44:41PM +1000, Peter Jeremy wrote:
>> >  A panic usually means that
>> >something unrecoverable happened, and that continuing on is not safe.
>> 
>> I realise that.  Hence actually being able to continue after a panic
>> would be extremely difficult to do safely.  (Probably not possible in
>> general, though it might be in some special cases).
>
>If it's save to continue then there's no need to panic at all.
>Just stoping the faulting parts would be enough in that case.

Except FreeBSD (and most Unices) don't do this in general.

I was thinking of hardware failures - if a CPU fails and it wasn't
holding any locks then it would seem feasible to just abort the
thread/process that was using the CPU and limp along on the remaining
CPU(s).

Likewise an unrecoverable memory error in a clean page should (in most
cases) be able to be recovered by marking that page unusable and
loading another copy of the data into another page.  (Obviously this
is problematic if the page in question is part of the kernel VM
subsystem or the device driver for the relevant backing store).  Even
a dirty page may be recoverable by aborting the affected process or
treating it similarly to an I/O error on a filesystem.

The marketing spin from at least one vendor suggests that their
high-end systems can manage this sort of fault recovery.  I'm not sure
whether this is an area that FreeBSD should aspire to - I suspect that
the effort needed to implement and test this would not be justified by
the small size of the additional potential market.

Peter