From owner-freebsd-hackers@freebsd.org  Fri Nov 15 16:58:28 2019
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id C840E1AAC72
 for <freebsd-hackers@mailman.nyi.freebsd.org>;
 Fri, 15 Nov 2019 16:58:28 +0000 (UTC)
 (envelope-from danny@cs.huji.ac.il)
Received: from kabab.cs.huji.ac.il (kabab.cs.huji.ac.il [132.65.116.210])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 47F4LB725xz3xHs;
 Fri, 15 Nov 2019 16:58:26 +0000 (UTC)
 (envelope-from danny@cs.huji.ac.il)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
 d=cs.huji.ac.il; s=57791128; 
 h=References:To:Cc:In-Reply-To:Date:Subject:Mime-Version:Content-Type:Message-Id:From;
 bh=eDL+gEg9RTXirOmGOMSiLRSf3yxtsHlajsZvjn+0d8g=; 
 b=runL48Zd8pTx93vQKvt9rKuy26J8GiRNY+HKVMzDWNgF4HQoyBhRHwSM+p7/QHuHX5k9OJLIcyuPDvdCnANjzUnOUl+s/addjasUTRV/gumhu/F4Nxr3oO/PJh31VKPKnHJuxDsDamQj+/nFFnwIjPmnUOukAlI8lT7Sx3trfhhntrABx/fI8EuVUWFJJy25lrjOv2f14XLfhAp7T3G2Jfyxq/lEx656y0GtkLFan34NTciHhYoeoq6gBwg2ltVZ1R6daVVKVKLGlziJWor1cFn5uGt6TTpeiNd0bQFxCEsG5HGHvl3DeA/GYPqyRDmdi7fF5L9EUV+F1Er79fA62Q==;
Received: from macmini.bk.cs.huji.ac.il ([132.65.179.19])
 by kabab.cs.huji.ac.il with esmtp
 id 1iVevP-0005VA-Sb; Fri, 15 Nov 2019 18:58:23 +0200
From: Daniel Braniss <danny@cs.huji.ac.il>
Message-Id: <C1F71AE2-F9B7-4297-BA58-70F03A0E5123@cs.huji.ac.il>
Mime-Version: 1.0 (Mac OS X Mail 13.0 \(3601.0.10\))
Subject: Re: can the hardware watchdog reboot a hung kernel?
Date: Fri, 15 Nov 2019 18:58:23 +0200
In-Reply-To: <ede820ea5c5f71cea2a98955d02b700b483e1899.camel@freebsd.org>
Cc: freebsd-hackers <freebsd-hackers@freebsd.org>
To: Ian Lepore <ian@freebsd.org>
References: <EC4DB495-55D0-44BB-8D6A-0301785FADC7@cs.huji.ac.il>
 <9cded04a-9ae1-881e-3962-7ef0322e96ed@grosbein.net>
 <2AD912BF-97B0-421D-B561-722D74864DC9@cs.huji.ac.il>
 <828605fef472e04311c83a7de0d1f4df429ae717.camel@freebsd.org>
 <BEC1714A-2361-4B62-BEB9-82808920C269@cs.huji.ac.il>
 <ede820ea5c5f71cea2a98955d02b700b483e1899.camel@freebsd.org>
X-Mailer: Apple Mail (2.3601.0.10)
X-Rspamd-Queue-Id: 47F4LB725xz3xHs
X-Spamd-Bar: -
Authentication-Results: mx1.freebsd.org;
 dkim=pass header.d=cs.huji.ac.il header.s=57791128 header.b=runL48Zd;
 dmarc=pass (policy=none) header.from=huji.ac.il;
 spf=none (mx1.freebsd.org: domain of danny@cs.huji.ac.il has no SPF policy
 when checking 132.65.116.210) smtp.mailfrom=danny@cs.huji.ac.il
X-Spamd-Result: default: False [-1.99 / 15.00]; ARC_NA(0.00)[];
 NEURAL_HAM_MEDIUM(-0.99)[-0.990,0];
 R_DKIM_ALLOW(-0.20)[cs.huji.ac.il:s=57791128];
 FROM_HAS_DN(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[];
 MV_CASE(0.50)[];
 MIME_GOOD(-0.10)[multipart/alternative,text/plain];
 RCVD_TLS_LAST(0.00)[]; NEURAL_HAM_LONG(-1.00)[-0.999,0];
 IP_SCORE(-0.70)[ip: (-1.40), ipnet: 132.64.0.0/13(-1.20), asn: 378(-0.96),
 country: IL(0.05)]; TO_DN_ALL(0.00)[];
 DKIM_TRACE(0.00)[cs.huji.ac.il:+]; RCPT_COUNT_TWO(0.00)[2];
 RCVD_IN_DNSWL_NONE(0.00)[210.116.65.132.list.dnswl.org : 127.0.10.0];
 DMARC_POLICY_ALLOW(-0.50)[huji.ac.il,none]; R_SPF_NA(0.00)[];
 FROM_EQ_ENVFROM(0.00)[]; SUBJECT_ENDS_QUESTION(1.00)[];
 MIME_TRACE(0.00)[0:+,1:+,2:~];
 ASN(0.00)[asn:378, ipnet:132.64.0.0/13, country:IL];
 MID_RHS_MATCH_FROM(0.00)[]; RCVD_COUNT_TWO(0.00)[2]
Content-Type: text/plain;
	charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.29
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 15 Nov 2019 16:58:28 -0000


> On 14 Nov 2019, at 20:19, Ian Lepore <ian@freebsd.org> wrote:
>=20
> On Thu, 2019-11-14 at 20:10 +0200, Daniel Braniss wrote:
>>> On 14 Nov 2019, at 18:02, Ian Lepore <ian@freebsd.org> wrote:
>>>=20
>>> On Thu, 2019-11-14 at 17:35 +0200, Daniel Braniss wrote:
>>>>> On 14 Nov 2019, at 17:28, Eugene Grosbein <eugen@grosbein.net>
>>>>> wrote:
>>>>>=20
>>>>> 14.11.2019 21:52, Daniel Braniss wrote:
>>>>>=20
>>>>>> hi,
>>>>>> I have serveral hundred Nano-pi NEO running, and sometimes they
>>>>>> hang, since there is no console
>>>>>> available, the only solution is to do a power cycle - not so easy
>>>>>> since they are distributed in three buildings :-)
>>>>>>=20
>>>>>> I am looking at the watchdog stuff, but it seems that what I want
>>>>>> is not supported, i.e.
>>>>>> 	reboot the kernel when hung=20
>>>>>>=20
>>>>>> wishful thinking?
>>>>>=20
>>>>> It's possible if the hardware has such a watchdog and kernel
>>>>> subsystem watchdog(4) supports it.
>>>>> rc.conf(5) manual page describes watchdogd_enable option.
>>>>>=20
>>>>=20
>>>> yes, but it relys  on user land, what if the kernel is hung?=20
>>>>=20
>>>=20
>>> It relies on the userland daemon to issue the ioctl() calls to pet =
the
>>> dog.  If the kernel is hung, then userland code isn't going to run
>>> either, and the watchdog petting won't happen, and eventually the
>>> hardware reboots.
>>>=20
>>> We use this at $work specifically to reboot if the kernel hangs, =
using
>>> this config:
>>>=20
>>> watchdogd_enable=3DYES
>>> watchdogd_flags=3D"-s 16 -t 64 -x 64"
>>>=20
>>> That says the daemon should pet the dog every 16 seconds, and the
>>> hardware is programmed to reboot if 64 seconds elapses without =
petting.
>>> In addition, when watchdogd is shutdown normally (like during a =
normal
>>> system reboot) it doesn't disable the watchdog hardware, it sets the
>>> timeout to 64s to protect against any kind of hang during the =
reboot.=20
>>> The -t and -x times can be different, 64s just happens to work well =
for
>>> us in both cases.
>>>=20
>>> -- Ian
>>>=20
>>=20
>> ok, that is very encouraging, now a last question
>> how can i hang the kernel to test that the watchdog kicks in? apart =
from writing a kernel module :-)
>>=20
>=20
> One thing to be careful of here is multicore systems.  If you have a
> critical app running on a multicore system, that app can hang (maybe =
it
> tries to read from a device that has malfunctioned and essentially =
gets
> hung forever in a device driver that doesn't implement timeouts very
> well or something).  In that case, only one core is hung, so watchdogd
> will be able to keep petting the dog to prevent a reboot, but since
> your app is hung on a different core, you aren't really getting the
> protection you need.
>=20
> The fix for that is to either turn you app into watchdogd (have it =
make
> the periodic ioctl() calls to pet the dog), or use the '-e cmd' option
> with watchdogd, and make 'cmd' be a script that somehow verifies that
> your critical application is still running properly.
>=20
> =E2=80=94Ian

in my case the kernel is hung, probably by my app - which is using 2 i2c =
devices, , BTW, this does not happen very often,=20
maybe once a month, but is annoying.

now the watchdog stuff:
1- the all winner/nanopi neo can only handle up to 8 sec timeout (the =
next  is 16sec (2^34))
    the watchdogd complainsif >8sec:
	aw_wdog0: Can't arm, timeout is more than 16 sec
   and continues trying - IMHO it should exit.

2- this is a bit more annoying:
	entering the debugger will trigger the timeout and it will the =
perform a clean reboot (*)
	doing a shutdown -r leaves the watchdog in some weird state so =
the reboot hangs when starting the watchdog
	  win some, loose some :-)

*: in MHO, entering the debugger should stop the hardware timeout - or =
at least optional


cheers and thanks

	danny