From owner-freebsd-hackers@FreeBSD.ORG  Fri Apr  3 14:19:36 2009
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id DC7871065672
	for <freebsd-hackers@freebsd.org>; Fri,  3 Apr 2009 14:19:36 +0000 (UTC)
	(envelope-from ambrisko@ambrisko.com)
Received: from mail.ambrisko.com (mail.ambrisko.com [64.174.51.43])
	by mx1.freebsd.org (Postfix) with ESMTP id AE3378FC1D
	for <freebsd-hackers@freebsd.org>; Fri,  3 Apr 2009 14:19:36 +0000 (UTC)
	(envelope-from ambrisko@ambrisko.com)
X-Ambrisko-Me: Yes
Received: from server2.ambrisko.com (HELO www.ambrisko.com) ([192.168.1.2])
	by ironport.ambrisko.com with ESMTP; 03 Apr 2009 07:20:29 -0700
Received: from ambrisko.com (localhost [127.0.0.1])
	by www.ambrisko.com (8.14.3/8.14.1) with ESMTP id n33EJZCA069856;
	Fri, 3 Apr 2009 07:19:35 -0700 (PDT)
	(envelope-from ambrisko@ambrisko.com)
Received: (from ambrisko@localhost)
	by ambrisko.com (8.14.3/8.14.3/Submit) id n33EJYb8069855;
	Fri, 3 Apr 2009 07:19:34 -0700 (PDT) (envelope-from ambrisko)
From: Doug Ambrisko <ambrisko@ambrisko.com>
Message-Id: <200904031419.n33EJYb8069855@ambrisko.com>
In-Reply-To: <20090403084601.108111xg6o3b49ms@webmail.leidinger.net>
To: Alexander Leidinger <Alexander@leidinger.net>
Date: Fri, 3 Apr 2009 07:19:34 -0700 (PDT)
X-Mailer: ELM [version 2.4ME+ PL94b (25)]
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=US-ASCII
Cc: freebsd-hackers@freebsd.org, Andriy Gapon <avg@icyb.net.ua>
Subject: Re: watchdog: hw+sw?
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 03 Apr 2009 14:19:37 -0000

Alexander Leidinger writes:
| Quoting Doug Ambrisko <ambrisko@ambrisko.com> (from Thu, 2 Apr 2009  
| 16:16:34 -0700 (PDT)):
| 
| > This worked well for us so I think it is a good idea.  Also some HW
| > watchdogs can be told to generate an NMI which can also produce a kernel
| > dump/ddb prompt.  I've also implemented some rough code to put an
| > simplified back-trace into the IPMI event log in-case a disk or disk
| > I/O sub-system died.
| 
| Somewhat related... I have 2 32bit systems with zfs which lock up  
| after a while. The lockup is strictly related to the disks. I can  
| still ping the system just fine, and the HW watchdog seems to still  
| work as intended (or it does not work at all anymore, as there's not  
| automatic reset), but as soon as I want to do something which involves  
| disks (access a webpage located on the zfs disks), I'm lost. The only  
| way to get some useful work done again is to reset manually. Your  
| paragraph above implies that the WD notices that there's a problem  
| with disks.

Yep, isn't that fun :-(
 
| While I know how to teach our watchdogd how to detect this (-e  
| option), we do not have support for this in the basesystem yet. Do you  
| have a patch for /etc/rc.d/watchdogd which allows to specify commands  
| to run via rc.conf or some patch which tells watchdogd to check a file?

We start watchdogd manually with our own rc.d script mainly since
I noticed Dell pe2650's do false triggers :-(  Also I wanted to check 
that our app. is functioning so we'd need to start after that.  It 
would be good to add flags option to the stock start-up scripts. 
Just having watchdogd running without checking on anything real tends
to be useless since it is usually swapped in and can run just fine
without depending on much of the system.

Doug A.