From owner-freebsd-hackers@FreeBSD.ORG Thu Apr 2 23:44:44 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A8A711065672 for ; Thu, 2 Apr 2009 23:44:44 +0000 (UTC) (envelope-from ambrisko@ambrisko.com) Received: from mail.ambrisko.com (mail.ambrisko.com [64.174.51.43]) by mx1.freebsd.org (Postfix) with ESMTP id 7FF468FC0C for ; Thu, 2 Apr 2009 23:44:44 +0000 (UTC) (envelope-from ambrisko@ambrisko.com) X-Ambrisko-Me: Yes Received: from server2.ambrisko.com (HELO www.ambrisko.com) ([192.168.1.2]) by ironport.ambrisko.com with ESMTP; 02 Apr 2009 16:17:27 -0700 Received: from ambrisko.com (localhost [127.0.0.1]) by www.ambrisko.com (8.14.3/8.14.1) with ESMTP id n32NGY3g015341; Thu, 2 Apr 2009 16:16:34 -0700 (PDT) (envelope-from ambrisko@ambrisko.com) Received: (from ambrisko@localhost) by ambrisko.com (8.14.3/8.14.3/Submit) id n32NGYWK015340; Thu, 2 Apr 2009 16:16:34 -0700 (PDT) (envelope-from ambrisko) From: Doug Ambrisko Message-Id: <200904022316.n32NGYWK015340@ambrisko.com> In-Reply-To: <49D4A16F.6020906@icyb.net.ua> To: Andriy Gapon Date: Thu, 2 Apr 2009 16:16:34 -0700 (PDT) X-Mailer: ELM [version 2.4ME+ PL94b (25)] MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=US-ASCII Cc: freebsd-hackers@freebsd.org Subject: Re: watchdog: hw+sw? X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 02 Apr 2009 23:44:45 -0000 Andriy Gapon writes: | I have some vague thoughts on using SW_WATCHDOG and a hardware watchdog | together. | I think this could be useful but I am not sure how to implement this. | The idea is this: timeout for SW_WATCHDOG is smaller than timeout for hw | wd; when some freeze happens sw wd logic kicks in first, stops hw wd and | produces either panic or ddb prompt; if the freeze is so severe that sw | wd can't run (e.g. hardware is messed up badly) then hw wd performs its | duty. I am mostly interested in having this in unattended mode where kernel | dump could be useful for later analysis but the system should recover in | reasonable time. | | Suggestions, opinions? At prior company I implemented a watchdog before watchdog(4) that did this. I used the HW watchdog to register with the SW watchdog. Then our SW watchdog was ticked via a syctl count down. This way we could implement a fairly arbitrary range of time-outs since most HW is very limited in the time duration and then we didn't really have to worry about it. If the SW watchdog didn't tick in a 10 seconds or so then the machine is probably dead. So we used the HW watchdog to enforce the SW watchdog. It's really nice getting the panic and dump. This worked well for us so I think it is a good idea. Also some HW watchdogs can be told to generate an NMI which can also produce a kernel dump/ddb prompt. I've also implemented some rough code to put an simplified back-trace into the IPMI event log in-case a disk or disk I/O sub-system died. Doug A.