From owner-freebsd-hackers  Fri Feb  2 11:09:54 1996
Return-Path: owner-hackers
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.3/8.7.3) id JAA01595
          for hackers-outgoing; Fri, 2 Feb 1996 09:07:40 -0800 (PST)
Received: from who.cdrom.com (who.cdrom.com [192.216.222.3])
          by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id JAA01572
          for <freebsd-hackers@FreeBSD.org>; Fri, 2 Feb 1996 09:07:37 -0800 (PST)
Received: from irz301.inf.tu-dresden.de (irz301.inf.tu-dresden.de [141.76.1.11])
          by who.cdrom.com (8.6.12/8.6.11) with ESMTP id CAA05851
          for <freebsd-hackers@freebsd.org>; Fri, 2 Feb 1996 02:25:26 -0800
Received: from sax.sax.de by irz301.inf.tu-dresden.de (8.6.12/8.6.12-s1) with ESMTP id LAA23568 for <freebsd-hackers@freebsd.org>; Fri, 2 Feb 1996 11:21:37 +0100
Received: by sax.sax.de (8.6.11/8.6.12-s1) with UUCP
	id LAA11958 for freebsd-hackers@freebsd.org; Fri, 2 Feb 1996 11:21:36 +0100
Received: (from j@localhost) by uriah.heep.sax.de (8.7.3/8.6.9) id LAA04930 for freebsd-hackers@freebsd.org; Fri, 2 Feb 1996 11:20:25 +0100 (MET)
From: J Wunsch <j@uriah.heep.sax.de>
Message-Id: <199602021020.LAA04930@uriah.heep.sax.de>
Subject: Re: Watchdog timers
To: freebsd-hackers@freebsd.org (FreeBSD hackers)
Date: Fri, 2 Feb 1996 11:20:24 +0100 (MET)
Reply-To: joerg_wunsch@uriah.heep.sax.de (Joerg Wunsch)
In-Reply-To: <199602011921.LAA23294@bluewhale.emergent.com> from "Curt Mayer" at Feb 1, 96 11:21:04 am
X-Phone: +49-351-2012 669
X-Mailer: ELM [version 2.4 PL23]
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Sender: owner-hackers@freebsd.org
Precedence: bulk

As Curt Mayer wrote:
> 
> 
> hey, guys. here's a solution that smells much more like unix.
> have a daemon running on each node that is prone to hangup.
> this process wakes up every once in a while and does a system checkup.
> (stats things, pings places, looks at kernel statistics). when it see
> that things are ok, it sends a datagram to a particular machine, 
> 
> this node, the monitor, has a table in memory of all recent datagrams
> from each node. when a node hasn't been heard from for a while, it
> tells a BSR x10 controller to cycle power on the hung node. DUH.

Idea stolen from Linux: create a /dev/watchdog for this purpose.  Once
it is held open by a process, the kernel resets the CPU if it doesn't
get a response on a device after a certain time.

The idea behind this is that most of the hanging systems have still a
running async portion of the kernel, i.e. things like interrupt
handling continue to work, but the process context switching hangs for
some reason (e.g. SCSI bus hangs etc.).  The chances are good that the
kernel could still kill itself.

Not ideal, but also no cost.

-- 
cheers, J"org

joerg_wunsch@uriah.heep.sax.de -- http://www.sax.de/~joerg/ -- NIC: JW11-RIPE
Never trust an operating system you don't have sources for. ;-)