From owner-freebsd-hackers  Sat Feb  3 08:43:48 1996
Return-Path: owner-hackers
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.3/8.7.3) id IAA09932
          for hackers-outgoing; Sat, 3 Feb 1996 08:43:48 -0800 (PST)
Received: from etinc.com (et-gw.etinc.com [165.254.13.209])
          by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id IAA09927
          for <hackers@freebsd.org>; Sat, 3 Feb 1996 08:43:46 -0800 (PST)
Received: from dialup-usr11.etinc.com (dialup-usr11.etinc.com [204.141.95.132]) by etinc.com (8.6.12/8.6.9) with SMTP id LAA25204 for <hackers@freebsd.org>; Sat, 3 Feb 1996 11:43:44 -0500
Date: Sat, 3 Feb 1996 11:43:44 -0500
Message-Id: <199602031643.LAA25204@etinc.com>
X-Sender: dennis@etinc.com
X-Mailer: Windows Eudora Version 2.0.3
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
To: hackers@freebsd.org
From: dennis@etinc.com (dennis)
Subject: Re: Watchdog timers
Sender: owner-hackers@freebsd.org
Precedence: bulk

>As Curt Mayer wrote:
>> 
>> 
>> hey, guys. here's a solution that smells much more like unix.
>> have a daemon running on each node that is prone to hangup.
>> this process wakes up every once in a while and does a system checkup.
>> (stats things, pings places, looks at kernel statistics). when it see
>> that things are ok, it sends a datagram to a particular machine, 
>> 
>> this node, the monitor, has a table in memory of all recent datagrams
>> from each node. when a node hasn't been heard from for a while, it
>> tells a BSR x10 controller to cycle power on the hung node. DUH.
>
>Idea stolen from Linux: create a /dev/watchdog for this purpose.  Once
>it is held open by a process, the kernel resets the CPU if it doesn't
>get a response on a device after a certain time.
>
>The idea behind this is that most of the hanging systems have still a
>running async portion of the kernel, i.e. things like interrupt
>handling continue to work, but the process context switching hangs for
>some reason (e.g. SCSI bus hangs etc.).  The chances are good that the
>kernel could still kill itself.
>
>Not ideal, but also no cost.

Unfortunately, in  LINUX most of the hangs seem to be due to
interrupt hangs. Its also nice to be able to customize the criteria for
reboot. For example we had someone who had a HDD controller  that failed
occationally (didnt actually hang the system)...so they did sanity
tests on it and rebooted when it failed (which is really a demand reset
rather than a watchdog function).  We've found that most of the
people that want WDTs have machines that don't reboot reliably for one
reason or another or require a hard reset, particularly those with
remote systems and they dont want to take the chance on a soft reset.

dennis
----------------------------------------------------------------------------
Emerging Technologies, Inc.      http://www.etinc.com

Synchronous PC Cards and Routers For Discriminating
Tastes. 56k to T1 and beyond. Frame Relay, PPP, HDLC, 
and X.25 for BSD/OS, FreeBSD and LINUX.