From owner-freebsd-hackers@FreeBSD.ORG  Thu Apr  2 23:44:44 2009
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id A8A711065672
	for <freebsd-hackers@freebsd.org>; Thu,  2 Apr 2009 23:44:44 +0000 (UTC)
	(envelope-from ambrisko@ambrisko.com)
Received: from mail.ambrisko.com (mail.ambrisko.com [64.174.51.43])
	by mx1.freebsd.org (Postfix) with ESMTP id 7FF468FC0C
	for <freebsd-hackers@freebsd.org>; Thu,  2 Apr 2009 23:44:44 +0000 (UTC)
	(envelope-from ambrisko@ambrisko.com)
X-Ambrisko-Me: Yes
Received: from server2.ambrisko.com (HELO www.ambrisko.com) ([192.168.1.2])
	by ironport.ambrisko.com with ESMTP; 02 Apr 2009 16:17:27 -0700
Received: from ambrisko.com (localhost [127.0.0.1])
	by www.ambrisko.com (8.14.3/8.14.1) with ESMTP id n32NGY3g015341;
	Thu, 2 Apr 2009 16:16:34 -0700 (PDT)
	(envelope-from ambrisko@ambrisko.com)
Received: (from ambrisko@localhost)
	by ambrisko.com (8.14.3/8.14.3/Submit) id n32NGYWK015340;
	Thu, 2 Apr 2009 16:16:34 -0700 (PDT) (envelope-from ambrisko)
From: Doug Ambrisko <ambrisko@ambrisko.com>
Message-Id: <200904022316.n32NGYWK015340@ambrisko.com>
In-Reply-To: <49D4A16F.6020906@icyb.net.ua>
To: Andriy Gapon <avg@icyb.net.ua>
Date: Thu, 2 Apr 2009 16:16:34 -0700 (PDT)
X-Mailer: ELM [version 2.4ME+ PL94b (25)]
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=US-ASCII
Cc: freebsd-hackers@freebsd.org
Subject: Re: watchdog: hw+sw?
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 02 Apr 2009 23:44:45 -0000

Andriy Gapon writes:
| I have some vague thoughts on using SW_WATCHDOG and a hardware watchdog 
| together.
| I think this could be useful but I am not sure how to implement this.
| The idea is this: timeout for SW_WATCHDOG is smaller than timeout for hw 
| wd; when some freeze happens sw wd logic kicks in first, stops hw wd and 
| produces either panic or ddb prompt; if the freeze is so severe that sw 
| wd can't run (e.g. hardware is messed up badly) then hw wd performs its 
| duty. I am mostly interested in having this in unattended mode where kernel 
| dump could be useful for later analysis but the system should recover in 
| reasonable time.
| 
| Suggestions, opinions?

At prior company I implemented a watchdog before watchdog(4) that did
this.  I used the HW watchdog to register with the SW watchdog.  Then
our SW watchdog was ticked via a syctl count down.  This way we could
implement a fairly arbitrary range of time-outs since most HW is very
limited in the time duration and then we didn't really have to worry
about it.  If the SW watchdog didn't tick in a 10 seconds or so then the
machine is probably dead.  So we used the HW watchdog to enforce the 
SW watchdog.  It's really nice getting the panic and dump.

This worked well for us so I think it is a good idea.  Also some HW 
watchdogs can be told to generate an NMI which can also produce a kernel 
dump/ddb prompt.  I've also implemented some rough code to put an 
simplified back-trace into the IPMI event log in-case a disk or disk 
I/O sub-system died.

Doug A.