Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 2 Apr 2009 16:16:34 -0700 (PDT)
From:      Doug Ambrisko <ambrisko@ambrisko.com>
To:        Andriy Gapon <avg@icyb.net.ua>
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: watchdog: hw+sw?
Message-ID:  <200904022316.n32NGYWK015340@ambrisko.com>
In-Reply-To: <49D4A16F.6020906@icyb.net.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
Andriy Gapon writes:
| I have some vague thoughts on using SW_WATCHDOG and a hardware watchdog 
| together.
| I think this could be useful but I am not sure how to implement this.
| The idea is this: timeout for SW_WATCHDOG is smaller than timeout for hw 
| wd; when some freeze happens sw wd logic kicks in first, stops hw wd and 
| produces either panic or ddb prompt; if the freeze is so severe that sw 
| wd can't run (e.g. hardware is messed up badly) then hw wd performs its 
| duty. I am mostly interested in having this in unattended mode where kernel 
| dump could be useful for later analysis but the system should recover in 
| reasonable time.
| 
| Suggestions, opinions?

At prior company I implemented a watchdog before watchdog(4) that did
this.  I used the HW watchdog to register with the SW watchdog.  Then
our SW watchdog was ticked via a syctl count down.  This way we could
implement a fairly arbitrary range of time-outs since most HW is very
limited in the time duration and then we didn't really have to worry
about it.  If the SW watchdog didn't tick in a 10 seconds or so then the
machine is probably dead.  So we used the HW watchdog to enforce the 
SW watchdog.  It's really nice getting the panic and dump.

This worked well for us so I think it is a good idea.  Also some HW 
watchdogs can be told to generate an NMI which can also produce a kernel 
dump/ddb prompt.  I've also implemented some rough code to put an 
simplified back-trace into the IPMI event log in-case a disk or disk 
I/O sub-system died.

Doug A.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200904022316.n32NGYWK015340>