Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 15 Nov 2019 10:11:02 -0700
From:      Ian Lepore <ian@freebsd.org>
To:        Daniel Braniss <danny@cs.huji.ac.il>
Cc:        freebsd-hackers <freebsd-hackers@freebsd.org>
Subject:   Re: can the hardware watchdog reboot a hung kernel?
Message-ID:  <9df4efbbbbb4fd4be81b94894f225c7ec92cc608.camel@freebsd.org>
In-Reply-To: <C1F71AE2-F9B7-4297-BA58-70F03A0E5123@cs.huji.ac.il>
References:  <EC4DB495-55D0-44BB-8D6A-0301785FADC7@cs.huji.ac.il> <9cded04a-9ae1-881e-3962-7ef0322e96ed@grosbein.net> <2AD912BF-97B0-421D-B561-722D74864DC9@cs.huji.ac.il> <828605fef472e04311c83a7de0d1f4df429ae717.camel@freebsd.org> <BEC1714A-2361-4B62-BEB9-82808920C269@cs.huji.ac.il> <ede820ea5c5f71cea2a98955d02b700b483e1899.camel@freebsd.org> <C1F71AE2-F9B7-4297-BA58-70F03A0E5123@cs.huji.ac.il>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 2019-11-15 at 18:58 +0200, Daniel Braniss wrote:
> > On 14 Nov 2019, at 20:19, Ian Lepore <ian@freebsd.org> wrote:
> > 
> > 
[...]
> > 
> > One thing to be careful of here is multicore systems.  If you have
> > a
> > critical app running on a multicore system, that app can hang
> > (maybe it
> > tries to read from a device that has malfunctioned and essentially
> > gets
> > hung forever in a device driver that doesn't implement timeouts
> > very
> > well or something).  In that case, only one core is hung, so
> > watchdogd
> > will be able to keep petting the dog to prevent a reboot, but since
> > your app is hung on a different core, you aren't really getting the
> > protection you need.
> > 
> > The fix for that is to either turn you app into watchdogd (have it
> > make
> > the periodic ioctl() calls to pet the dog), or use the '-e cmd'
> > option
> > with watchdogd, and make 'cmd' be a script that somehow verifies
> > that
> > your critical application is still running properly.
> > 
> > —Ian
> 
> in my case the kernel is hung, probably by my app - which is using 2
> i2c devices, , BTW, this does not happen very often, 
> maybe once a month, but is annoying.
> 
> now the watchdog stuff:
> 1- the all winner/nanopi neo can only handle up to 8 sec timeout (the
> next  is 16sec (2^34))
>     the watchdogd complainsif >8sec:
> 	aw_wdog0: Can't arm, timeout is more than 16 sec
>    and continues trying - IMHO it should exit.
> 

This basically comes down to "know your hardware and don't ask for
things it can't do".  There is a lot of variance in watchdog hardware,
and unfortunately our watchdog software interface is kinda braindead. 
It uses a power-of-2 timeout which is great if you need a large variety
of subsecond timeouts ranging from a few nanoseconds to a half second. 
But it's absolutely horrible for what the real world usually wants: 
some medium-sized integer number of seconds.  Your choices are pretty
much just 8, 16, 32, 64, 128.  Lots of hardware maxes at 16 or 32
seconds.

If aw maxes at 16 it's probably best to set it for that, with petting
at either 4 or 8 second intervals.

> 2- this is a bit more annoying:
> 	entering the debugger will trigger the timeout and it will the
> perform a clean reboot (*)

In the debugger, enter "watchdog" without any parameter to disable the
watchdog.  (Or give a parameter to change the timeout.)

Some watchdog hardware cannot be disabled once you've enabled it.

> 	doing a shutdown -r leaves the watchdog in some weird state so
> the reboot hangs when starting the watchdog
> 	  win some, loose some :-)
> 

This is likely another flavor of "some watchdog hardware cannot be
disabled".  But it might just be a bug in the aw watchdog driver too.

-- Ian





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?9df4efbbbbb4fd4be81b94894f225c7ec92cc608.camel>