Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 14 Nov 2019 11:19:52 -0700
From:      Ian Lepore <ian@freebsd.org>
To:        Daniel Braniss <danny@cs.huji.ac.il>
Cc:        freebsd-hackers <freebsd-hackers@freebsd.org>
Subject:   Re: can the hardware watchdog reboot a hung kernel?
Message-ID:  <ede820ea5c5f71cea2a98955d02b700b483e1899.camel@freebsd.org>
In-Reply-To: <BEC1714A-2361-4B62-BEB9-82808920C269@cs.huji.ac.il>
References:  <EC4DB495-55D0-44BB-8D6A-0301785FADC7@cs.huji.ac.il> <9cded04a-9ae1-881e-3962-7ef0322e96ed@grosbein.net> <2AD912BF-97B0-421D-B561-722D74864DC9@cs.huji.ac.il> <828605fef472e04311c83a7de0d1f4df429ae717.camel@freebsd.org> <BEC1714A-2361-4B62-BEB9-82808920C269@cs.huji.ac.il>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 2019-11-14 at 20:10 +0200, Daniel Braniss wrote:
> > On 14 Nov 2019, at 18:02, Ian Lepore <ian@freebsd.org> wrote:
> > 
> > On Thu, 2019-11-14 at 17:35 +0200, Daniel Braniss wrote:
> > > > On 14 Nov 2019, at 17:28, Eugene Grosbein <eugen@grosbein.net>
> > > > wrote:
> > > > 
> > > > 14.11.2019 21:52, Daniel Braniss wrote:
> > > > 
> > > > > hi,
> > > > > I have serveral hundred Nano-pi NEO running, and sometimes they
> > > > > hang, since there is no console
> > > > > available, the only solution is to do a power cycle - not so easy
> > > > > since they are distributed in three buildings :-)
> > > > > 
> > > > > I am looking at the watchdog stuff, but it seems that what I want
> > > > > is not supported, i.e.
> > > > > 	reboot the kernel when hung 
> > > > > 
> > > > > wishful thinking?
> > > > 
> > > > It's possible if the hardware has such a watchdog and kernel
> > > > subsystem watchdog(4) supports it.
> > > > rc.conf(5) manual page describes watchdogd_enable option.
> > > > 
> > > 
> > > yes, but it relys  on user land, what if the kernel is hung? 
> > > 
> > 
> > It relies on the userland daemon to issue the ioctl() calls to pet the
> > dog.  If the kernel is hung, then userland code isn't going to run
> > either, and the watchdog petting won't happen, and eventually the
> > hardware reboots.
> > 
> > We use this at $work specifically to reboot if the kernel hangs, using
> > this config:
> > 
> > watchdogd_enable=YES
> > watchdogd_flags="-s 16 -t 64 -x 64"
> > 
> > That says the daemon should pet the dog every 16 seconds, and the
> > hardware is programmed to reboot if 64 seconds elapses without petting.
> > In addition, when watchdogd is shutdown normally (like during a normal
> > system reboot) it doesn't disable the watchdog hardware, it sets the
> > timeout to 64s to protect against any kind of hang during the reboot. 
> > The -t and -x times can be different, 64s just happens to work well for
> > us in both cases.
> > 
> > -- Ian
> > 
> 
> ok, that is very encouraging, now a last question
> how can i hang the kernel to test that the watchdog kicks in? apart from writing a kernel module :-)
> 

One thing to be careful of here is multicore systems.  If you have a
critical app running on a multicore system, that app can hang (maybe it
tries to read from a device that has malfunctioned and essentially gets
hung forever in a device driver that doesn't implement timeouts very
well or something).  In that case, only one core is hung, so watchdogd
will be able to keep petting the dog to prevent a reboot, but since
your app is hung on a different core, you aren't really getting the
protection you need.

The fix for that is to either turn you app into watchdogd (have it make
the periodic ioctl() calls to pet the dog), or use the '-e cmd' option
with watchdogd, and make 'cmd' be a script that somehow verifies that
your critical application is still running properly.

-- Ian





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?ede820ea5c5f71cea2a98955d02b700b483e1899.camel>