From owner-freebsd-hackers@freebsd.org  Sat Nov 16 09:09:56 2019
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id 621DE1BF92A
 for <freebsd-hackers@mailman.nyi.freebsd.org>;
 Sat, 16 Nov 2019 09:09:56 +0000 (UTC)
 (envelope-from danny@cs.huji.ac.il)
Received: from kabab.cs.huji.ac.il (kabab.cs.huji.ac.il [132.65.116.210])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 47FTv80gw1z3FSq;
 Sat, 16 Nov 2019 09:09:55 +0000 (UTC)
 (envelope-from danny@cs.huji.ac.il)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
 d=cs.huji.ac.il; s=57791128; 
 h=To:References:Message-Id:Content-Transfer-Encoding:Cc:Date:In-Reply-To:From:Subject:Mime-Version:Content-Type;
 bh=2IuvAzI2uZl69PujqaBOvP0O4JCoTCC+HRHN/naEcZg=; 
 b=J5ptGsoGA1LmrlvuPhXd331NrD5Hi+eCs6FZTOdYmrJv6ylkdRNvPbVYanzOlFrSwhpAuNeiMaPEfh9Wi3c9eHyjOq6jlJSHtsf/nhgD+KsmfcOaJKxvsmQuNsMERmUTu5m14k4CW5s1S41umTwL0uCcOmDTnVBi/dp93qmb0QFM1qx4CeVBDx5cYJXmNAJKpx/jIX/ddeJMsblffOloCLdR4KqHsJ5BQQoOf5zVIIhfh8URYR0TzROoQkYBH8sYE7ZQ2OHqfLOWFywoK/z50gPwgnQi5tts7ZTaMt7GlzkmyhE3++s+G+xKSWbP1erptZUPH8Vo9GXOJOPkGoDmpg==;
Received: from macmini.bk.cs.huji.ac.il ([132.65.179.19])
 by kabab.cs.huji.ac.il with esmtp
 id 1iVu5Z-000EDJ-Ie; Sat, 16 Nov 2019 11:09:53 +0200
Content-Type: text/plain;
	charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 13.0 \(3601.0.10\))
Subject: Re: can the hardware watchdog reboot a hung kernel?
From: Daniel Braniss <danny@cs.huji.ac.il>
In-Reply-To: <9df4efbbbbb4fd4be81b94894f225c7ec92cc608.camel@freebsd.org>
Date: Sat, 16 Nov 2019 11:09:53 +0200
Cc: freebsd-hackers <freebsd-hackers@freebsd.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <8ACEB61A-E76F-4226-B2F7-5AD753457002@cs.huji.ac.il>
References: <EC4DB495-55D0-44BB-8D6A-0301785FADC7@cs.huji.ac.il>
 <9cded04a-9ae1-881e-3962-7ef0322e96ed@grosbein.net>
 <2AD912BF-97B0-421D-B561-722D74864DC9@cs.huji.ac.il>
 <828605fef472e04311c83a7de0d1f4df429ae717.camel@freebsd.org>
 <BEC1714A-2361-4B62-BEB9-82808920C269@cs.huji.ac.il>
 <ede820ea5c5f71cea2a98955d02b700b483e1899.camel@freebsd.org>
 <C1F71AE2-F9B7-4297-BA58-70F03A0E5123@cs.huji.ac.il>
 <9df4efbbbbb4fd4be81b94894f225c7ec92cc608.camel@freebsd.org>
To: Ian Lepore <ian@freebsd.org>
X-Mailer: Apple Mail (2.3601.0.10)
X-Rspamd-Queue-Id: 47FTv80gw1z3FSq
X-Spamd-Bar: -----
Authentication-Results: mx1.freebsd.org;
	none
X-Spamd-Result: default: False [-5.98 / 15.00];
 NEURAL_HAM_MEDIUM(-0.98)[-0.982,0];
 NEURAL_HAM_LONG(-1.00)[-1.000,0]; REPLY(-4.00)[]
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 16 Nov 2019 09:09:56 -0000


> On 15 Nov 2019, at 19:11, Ian Lepore <ian@freebsd.org> wrote:
>=20
> On Fri, 2019-11-15 at 18:58 +0200, Daniel Braniss wrote:
>>> On 14 Nov 2019, at 20:19, Ian Lepore <ian@freebsd.org> wrote:
>>>=20
>>>=20
> [...]
>>>=20
>>> One thing to be careful of here is multicore systems.  If you have
>>> a
>>> critical app running on a multicore system, that app can hang
>>> (maybe it
>>> tries to read from a device that has malfunctioned and essentially
>>> gets
>>> hung forever in a device driver that doesn't implement timeouts
>>> very
>>> well or something).  In that case, only one core is hung, so
>>> watchdogd
>>> will be able to keep petting the dog to prevent a reboot, but since
>>> your app is hung on a different core, you aren't really getting the
>>> protection you need.
>>>=20
>>> The fix for that is to either turn you app into watchdogd (have it
>>> make
>>> the periodic ioctl() calls to pet the dog), or use the '-e cmd'
>>> option
>>> with watchdogd, and make 'cmd' be a script that somehow verifies
>>> that
>>> your critical application is still running properly.
>>>=20
>>> =E2=80=94Ian
>>=20
>> in my case the kernel is hung, probably by my app - which is using 2
>> i2c devices, , BTW, this does not happen very often,=20
>> maybe once a month, but is annoying.
>>=20
>> now the watchdog stuff:
>> 1- the all winner/nanopi neo can only handle up to 8 sec timeout (the
>> next  is 16sec (2^34))
>>    the watchdogd complainsif >8sec:
>> 	aw_wdog0: Can't arm, timeout is more than 16 sec
>>   and continues trying - IMHO it should exit.
>>=20
>=20
> This basically comes down to "know your hardware and don't ask for
> things it can't do".  There is a lot of variance in watchdog hardware,
> and unfortunately our watchdog software interface is kinda braindead.=20=

> It uses a power-of-2 timeout which is great if you need a large =
variety
> of subsecond timeouts ranging from a few nanoseconds to a half second.=20=

> But it's absolutely horrible for what the real world usually wants:=20
> some medium-sized integer number of seconds.  Your choices are pretty
> much just 8, 16, 32, 64, 128.  Lots of hardware maxes at 16 or 32
> seconds.
>=20
> If aw maxes at 16 it's probably best to set it for that, with petting
> at either 4 or 8 second intervals.
>=20
>> 2- this is a bit more annoying:
>> 	entering the debugger will trigger the timeout and it will the
>> perform a clean reboot (*)
>=20
> In the debugger, enter "watchdog" without any parameter to disable the
> watchdog.  (Or give a parameter to change the timeout.)
>=20
> Some watchdog hardware cannot be disabled once you've enabled it.
>=20
>> 	doing a shutdown -r leaves the watchdog in some weird state so
>> the reboot hangs when starting the watchdog
>> 	  win some, loose some :-)
>>=20
>=20
> This is likely another flavor of "some watchdog hardware cannot be
> disabled".  But it might just be a bug in the aw watchdog driver too.
>=20
> =E2=80=94Ian
>=20
>=20

i have a workaround,
 start the watchdogd by hand (not via rc.conf) then shutdown does not =
stop the watchdog, and all is ok
I guess there must be some bug in the reset logic in aw_dog.c

danny