From owner-freebsd-hackers@FreeBSD.ORG Tue Apr 7 12:37:29 2009 Return-Path: Delivered-To: freebsd-hackers@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E8D2C1065674 for ; Tue, 7 Apr 2009 12:37:29 +0000 (UTC) (envelope-from avg@icyb.net.ua) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 34DF88FC08 for ; Tue, 7 Apr 2009 12:37:28 +0000 (UTC) (envelope-from avg@icyb.net.ua) Received: from odyssey.starpoint.kiev.ua (alpha-e.starpoint.kiev.ua [212.40.38.101]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id PAA02716 for ; Tue, 07 Apr 2009 15:37:27 +0300 (EEST) (envelope-from avg@icyb.net.ua) Message-ID: <49DB4906.7080407@icyb.net.ua> Date: Tue, 07 Apr 2009 15:37:26 +0300 From: Andriy Gapon User-Agent: Thunderbird 2.0.0.21 (X11/20090323) MIME-Version: 1.0 To: freebsd-hackers@FreeBSD.org References: <49D4A16F.6020906@icyb.net.ua> In-Reply-To: <49D4A16F.6020906@icyb.net.ua> X-Enigmail-Version: 0.95.7 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: Subject: Re: watchdog: hw+sw? X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Apr 2009 12:37:30 -0000 I've been thinking about this some more. So, clearly, sw watchdog is different from all the hw watchdogs (that I know about) in that it tries to take a debugging action as opposed to a straightforward recovery action. As such it currently doesn't make much sense to mix sw and hw watchdogs together, because in the case of a problem they would fire at close times and a (typical) hw watchdog would override sw watchdog. This is fine as it is, maybe a small warning in a case of such mix would be nice too. However, I think that it should be possible to use sw watchdog as a special "primary" watchdog and hw watchdog(s) as "failsafe" watchdogs for the primary one. I see two general approaches at the moment: 1. hw watchdog has only "slightly" longer timeout than the sw watchdog (by a configurable delta), the watchdogs gets patted at the same time; if the sw wd fires and is able to proceed, it first disables hw watchdog(s) and the performs its duty (panic, ddb); 2. hw watchdog has "substantially" longer timeout that the sw watchdog (by a configurable delta), the watchdogs gets patted at the same time; if the sw wd fires it has a limited amount of time to do its action before the hw wd fires too; in this case it would also be nice to have a short ddb command for stopping hw watchdog. Each approach has its own advantages and disadvantages. The first approach guarantees that sw wd would not be interrupted by hw wd. On the other hand, there is no protection e.g. from a system getting stuck during a dump. Also, hw watchdogs would have to provide a method for "emergency stop" that should be safe from locking issues. The second approach is more robust. Its problems: (a) it can interrupt sw wd action too early; (b) it wastes more time if sw wd is not able to fire. Since using sw and hw watchdogs together makes more sense in unattended scenarios, I think that approach #2 may be better. IMO, attended scenarios should use sw wd exclusively. -- Andriy Gapon