Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 12 Feb 2013 17:17:56 -0800
From:      Alfred Perlstein <bright@mu.org>
To:        "arch@freebsd.org" <arch@freebsd.org>,  Poul-Henning Kamp <phk@phk.freebsd.dk>
Subject:   request for preliminary review, enhanced watchdog.
Message-ID:  <511AE9C4.4030301@mu.org>

next in thread | raw e-mail | index | archive | help
At work we've had some issues with superfluous watchdog timeouts firing.

Since we use an ipmi/external watchdog the system is completely reset 
and we are unable to gather metrics.

I investigated the issue and then compared to what is offered by Linux 
and decided to crib from their API such that we can benefit from an 
enhanced watchdog.

I have a WIP at this time in a branch that I would hope people could 
weigh in on and review as well as make technical suggestions.

The branch is located here:
   svn+ssh://svn.freebsd.org/base/user/alfred/ewatchdog

The easy way to get changes:
   svn log --stop-on-copy 
svn+ssh://svn.freebsd.org/base/user/alfred/ewatchdog

1) Support for pre-watchdog timeout.  This means that so long as the 
kernel is somewhat functional (callouts are working) we can trigger a 
configurable action (panic,ddb,log) if the watchdog program is otherwise 
hung.
2) Support for built-in software watchdog that has the same options 
(panic,ddb,log) if the watchdog times out.  This is useful for 
prototyping and was done instead of using the SW_WATCHDOG in 
kern_clock.c because of the ease of working the code into watchdog.c 
versus communication via the EVENTHANDLER api.
3) Support for Linux-like API. (WDIOC_GETTIMELEFT, 
WDIOC_SETTIMEOUT,WDIOC_GETTIMEOUT, etc)
4) Modifications to watchdogd(8):
    - Warn if the watchdog program takes too long.
    - Disable activation of the system watchdog so that one can test the 
watchdogd script
      without potentially rebooting the system.
    - Ability to log to syslog when scripts begin to timeout.
    - When told to measure time, do not unconditionally nap for 'sleep' 
seconds, instead adjust
      the naptime by the elapsed time so as not to trigger the watchdog.

I've not yet hooked in the optional pre-timeout code into watchdogd(8) 
but plan on doing so later in the week.

It would be really helpful if we could decide on a way of selecting 
which watchdogs to arm/fire and how to query them.  I may adopt the 
Linux API unless someone has alternative suggestions that make a strong 
enough case to forge our own API.

thank you,
-Alfred




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?511AE9C4.4030301>