Date: Fri, 15 Nov 2019 18:58:23 +0200 From: Daniel Braniss <danny@cs.huji.ac.il> To: Ian Lepore <ian@freebsd.org> Cc: freebsd-hackers <freebsd-hackers@freebsd.org> Subject: Re: can the hardware watchdog reboot a hung kernel? Message-ID: <C1F71AE2-F9B7-4297-BA58-70F03A0E5123@cs.huji.ac.il> In-Reply-To: <ede820ea5c5f71cea2a98955d02b700b483e1899.camel@freebsd.org> References: <EC4DB495-55D0-44BB-8D6A-0301785FADC7@cs.huji.ac.il> <9cded04a-9ae1-881e-3962-7ef0322e96ed@grosbein.net> <2AD912BF-97B0-421D-B561-722D74864DC9@cs.huji.ac.il> <828605fef472e04311c83a7de0d1f4df429ae717.camel@freebsd.org> <BEC1714A-2361-4B62-BEB9-82808920C269@cs.huji.ac.il> <ede820ea5c5f71cea2a98955d02b700b483e1899.camel@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
> On 14 Nov 2019, at 20:19, Ian Lepore <ian@freebsd.org> wrote: >=20 > On Thu, 2019-11-14 at 20:10 +0200, Daniel Braniss wrote: >>> On 14 Nov 2019, at 18:02, Ian Lepore <ian@freebsd.org> wrote: >>>=20 >>> On Thu, 2019-11-14 at 17:35 +0200, Daniel Braniss wrote: >>>>> On 14 Nov 2019, at 17:28, Eugene Grosbein <eugen@grosbein.net> >>>>> wrote: >>>>>=20 >>>>> 14.11.2019 21:52, Daniel Braniss wrote: >>>>>=20 >>>>>> hi, >>>>>> I have serveral hundred Nano-pi NEO running, and sometimes they >>>>>> hang, since there is no console >>>>>> available, the only solution is to do a power cycle - not so easy >>>>>> since they are distributed in three buildings :-) >>>>>>=20 >>>>>> I am looking at the watchdog stuff, but it seems that what I want >>>>>> is not supported, i.e. >>>>>> reboot the kernel when hung=20 >>>>>>=20 >>>>>> wishful thinking? >>>>>=20 >>>>> It's possible if the hardware has such a watchdog and kernel >>>>> subsystem watchdog(4) supports it. >>>>> rc.conf(5) manual page describes watchdogd_enable option. >>>>>=20 >>>>=20 >>>> yes, but it relys on user land, what if the kernel is hung?=20 >>>>=20 >>>=20 >>> It relies on the userland daemon to issue the ioctl() calls to pet = the >>> dog. If the kernel is hung, then userland code isn't going to run >>> either, and the watchdog petting won't happen, and eventually the >>> hardware reboots. >>>=20 >>> We use this at $work specifically to reboot if the kernel hangs, = using >>> this config: >>>=20 >>> watchdogd_enable=3DYES >>> watchdogd_flags=3D"-s 16 -t 64 -x 64" >>>=20 >>> That says the daemon should pet the dog every 16 seconds, and the >>> hardware is programmed to reboot if 64 seconds elapses without = petting. >>> In addition, when watchdogd is shutdown normally (like during a = normal >>> system reboot) it doesn't disable the watchdog hardware, it sets the >>> timeout to 64s to protect against any kind of hang during the = reboot.=20 >>> The -t and -x times can be different, 64s just happens to work well = for >>> us in both cases. >>>=20 >>> -- Ian >>>=20 >>=20 >> ok, that is very encouraging, now a last question >> how can i hang the kernel to test that the watchdog kicks in? apart = from writing a kernel module :-) >>=20 >=20 > One thing to be careful of here is multicore systems. If you have a > critical app running on a multicore system, that app can hang (maybe = it > tries to read from a device that has malfunctioned and essentially = gets > hung forever in a device driver that doesn't implement timeouts very > well or something). In that case, only one core is hung, so watchdogd > will be able to keep petting the dog to prevent a reboot, but since > your app is hung on a different core, you aren't really getting the > protection you need. >=20 > The fix for that is to either turn you app into watchdogd (have it = make > the periodic ioctl() calls to pet the dog), or use the '-e cmd' option > with watchdogd, and make 'cmd' be a script that somehow verifies that > your critical application is still running properly. >=20 > =E2=80=94Ian in my case the kernel is hung, probably by my app - which is using 2 i2c = devices, , BTW, this does not happen very often,=20 maybe once a month, but is annoying. now the watchdog stuff: 1- the all winner/nanopi neo can only handle up to 8 sec timeout (the = next is 16sec (2^34)) the watchdogd complainsif >8sec: aw_wdog0: Can't arm, timeout is more than 16 sec and continues trying - IMHO it should exit. 2- this is a bit more annoying: entering the debugger will trigger the timeout and it will the = perform a clean reboot (*) doing a shutdown -r leaves the watchdog in some weird state so = the reboot hangs when starting the watchdog win some, loose some :-) *: in MHO, entering the debugger should stop the hardware timeout - or = at least optional cheers and thanks danny
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?C1F71AE2-F9B7-4297-BA58-70F03A0E5123>