Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 23 Jan 2013 09:57:33 -0700
From:      Ian Lepore <ian@FreeBSD.org>
To:        mjacob@FreeBSD.org
Cc:        freebsd-hackers@FreeBSD.org, Sushanth Rai <sushanth_rai@yahoo.com>
Subject:   Re: NMI watchdog functionality on Freebsd
Message-ID:  <1358960253.32417.467.camel@revolution.hippie.lan>
In-Reply-To: <5100142D.7040904@freebsd.org>
References:  <1358894455.17521.YahooMailClassic@web181706.mail.ne1.yahoo.com> <201301231025.41118.jhb@freebsd.org>  <5100142D.7040904@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 2013-01-23 at 08:47 -0800, Matthew Jacob wrote:
> On 1/23/2013 7:25 AM, John Baldwin wrote:
> > On Tuesday, January 22, 2013 5:40:55 pm Sushanth Rai wrote:
> >> Hi,
> >>
> >> Does freebsd have some functionality similar to  Linux's NMI watchdog ? I'm
> > aware of ichwd driver, but that depends to WDT to be available in the
> > hardware. Even when it is available, BIOS needs to support a mechanism to
> > trigger a OS level recovery to get any useful information when system is
> > really wedged (with interrupt disabled)
> The principle purpose of a watchdog is to keep the system from hanging. 
> Information is secondary. The ichwd driver can use the LPC part of ICH 
> hardware that's been there since ICH version 4. I implemented this more 
> fully at Panasas. The first importance is to keep the system from being 
> hung. The next piece of information is to detect, on reboot, that a 
> watchdog event occurred. Finally, trying to isolate why is good.
> 
> This is equivalent to the tco_WDT stuff on Linux. It's not interrupt 
> driven (it drives the reset line on the processor).
> 

I think there's value in the NMI watchdog idea, but unless you back it
up with a real hardware watchdog you don't really have full watchdog
functionality.  If the NMI can get the OS to produce some extra info,
that's great, and using an NMI gives you a good chance of doing that
even if it is normal interrupt processing that has wedged the machine.
But calling panic() invokes plenty of processing that can get wedged in
other ways, so even an NMI-based watchdog isn't g'teed to get the
machine running again.

But adding a real hardware watchdog that fires on a slightly longer
timeout than the NMI watchdog gives you the best of everything: you get
information if it's possible to produce it, and you get a real hardware
reset shortly thereafter if producing the info fails.

-- Ian





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1358960253.32417.467.camel>