From owner-freebsd-hackers  Thu Feb  1 11:21:11 1996
Return-Path: owner-hackers
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.3/8.7.3) id LAA21077
          for hackers-outgoing; Thu, 1 Feb 1996 11:21:11 -0800 (PST)
Received: from bluewhale.emergent.com (bluewhale.emergent.com [140.174.2.161])
          by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id LAA21068
          for <freebsd-hackers@freebsd.org>; Thu, 1 Feb 1996 11:21:05 -0800 (PST)
Received: from localhost (localhost [127.0.0.1]) by bluewhale.emergent.com (8.6.11/8.6.12) with SMTP id LAA23294 for <freebsd-hackers@freebsd.org>; Thu, 1 Feb 1996 11:21:04 -0800
Message-Id: <199602011921.LAA23294@bluewhale.emergent.com>
X-Authentication-Warning: bluewhale.emergent.com: Host localhost didn't use HELO protocol
To: freebsd-hackers@freebsd.org
Subject: Re: Watchdog timers
Date: Thu, 01 Feb 1996 11:21:04 -0800
From: Curt Mayer <curt@emergent.com>
Sender: owner-hackers@freebsd.org
Precedence: bulk


hey, guys. here's a solution that smells much more like unix.
have a daemon running on each node that is prone to hangup.
this process wakes up every once in a while and does a system checkup.
(stats things, pings places, looks at kernel statistics). when it see
that things are ok, it sends a datagram to a particular machine, 

this node, the monitor, has a table in memory of all recent datagrams
from each node. when a node hasn't been heard from for a while, it
tells a BSR x10 controller to cycle power on the hung node. DUH.

our ISP, tlg.net used to do routing and slip with sx-16's running NOS.
whenever a hang happened, tlg used to do a power cycle with X10's.

	curt