From owner-freebsd-hackers@FreeBSD.ORG  Tue Apr  7 12:37:29 2009
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id E8D2C1065674
	for <freebsd-hackers@FreeBSD.org>; Tue,  7 Apr 2009 12:37:29 +0000 (UTC)
	(envelope-from avg@icyb.net.ua)
Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140])
	by mx1.freebsd.org (Postfix) with ESMTP id 34DF88FC08
	for <freebsd-hackers@FreeBSD.org>; Tue,  7 Apr 2009 12:37:28 +0000 (UTC)
	(envelope-from avg@icyb.net.ua)
Received: from odyssey.starpoint.kiev.ua (alpha-e.starpoint.kiev.ua
	[212.40.38.101])
	by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id PAA02716
	for <freebsd-hackers@FreeBSD.org>;
	Tue, 07 Apr 2009 15:37:27 +0300 (EEST)
	(envelope-from avg@icyb.net.ua)
Message-ID: <49DB4906.7080407@icyb.net.ua>
Date: Tue, 07 Apr 2009 15:37:26 +0300
From: Andriy Gapon <avg@icyb.net.ua>
User-Agent: Thunderbird 2.0.0.21 (X11/20090323)
MIME-Version: 1.0
To: freebsd-hackers@FreeBSD.org
References: <49D4A16F.6020906@icyb.net.ua>
In-Reply-To: <49D4A16F.6020906@icyb.net.ua>
X-Enigmail-Version: 0.95.7
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Cc: 
Subject: Re: watchdog: hw+sw?
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 07 Apr 2009 12:37:30 -0000


I've been thinking about this some more. So, clearly, sw watchdog is different
from all the hw watchdogs (that I know about) in that it tries to take a debugging
action as opposed to a straightforward recovery action. As such it currently
doesn't make much sense to mix sw and hw watchdogs together, because in the case
of a problem they would fire at close times and a (typical) hw watchdog would
override sw watchdog.
This is fine as it is, maybe a small warning in a case of such mix would be nice too.

However, I think that it should be possible to use sw watchdog as a special
"primary" watchdog and hw watchdog(s) as "failsafe" watchdogs for the primary one.
I see two general approaches at the moment:
1. hw watchdog has only "slightly" longer timeout than the sw watchdog (by a
configurable delta), the watchdogs gets patted at the same time; if the sw wd
fires and is able to proceed, it first disables hw watchdog(s) and the performs
its duty (panic, ddb);
2. hw watchdog has "substantially" longer timeout that the sw watchdog (by a
configurable delta), the watchdogs gets patted at the same time; if the sw wd
fires it has a limited amount of time to do its action before the hw wd fires too;
in this case it would also be nice to have a short ddb command for stopping hw
watchdog.

Each approach has its own advantages and disadvantages.
The first approach guarantees that sw wd would not be interrupted by hw wd. On the
other hand, there is no protection e.g. from a system getting stuck during a dump.
Also, hw watchdogs would have to provide a method for "emergency stop" that should
be safe from locking issues.
The second approach is more robust. Its problems: (a) it can interrupt sw wd
action too early; (b) it wastes more time if sw wd is not able to fire.

Since using sw and hw watchdogs together makes more sense in unattended scenarios,
I think that approach #2 may be better. IMO, attended scenarios should use sw wd
exclusively.

-- 
Andriy Gapon