From owner-freebsd-hackers@freebsd.org Sat Nov 16 09:09:56 2019 Return-Path: Delivered-To: freebsd-hackers@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 621DE1BF92A for ; Sat, 16 Nov 2019 09:09:56 +0000 (UTC) (envelope-from danny@cs.huji.ac.il) Received: from kabab.cs.huji.ac.il (kabab.cs.huji.ac.il [132.65.116.210]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 47FTv80gw1z3FSq; Sat, 16 Nov 2019 09:09:55 +0000 (UTC) (envelope-from danny@cs.huji.ac.il) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=cs.huji.ac.il; s=57791128; h=To:References:Message-Id:Content-Transfer-Encoding:Cc:Date:In-Reply-To:From:Subject:Mime-Version:Content-Type; bh=2IuvAzI2uZl69PujqaBOvP0O4JCoTCC+HRHN/naEcZg=; b=J5ptGsoGA1LmrlvuPhXd331NrD5Hi+eCs6FZTOdYmrJv6ylkdRNvPbVYanzOlFrSwhpAuNeiMaPEfh9Wi3c9eHyjOq6jlJSHtsf/nhgD+KsmfcOaJKxvsmQuNsMERmUTu5m14k4CW5s1S41umTwL0uCcOmDTnVBi/dp93qmb0QFM1qx4CeVBDx5cYJXmNAJKpx/jIX/ddeJMsblffOloCLdR4KqHsJ5BQQoOf5zVIIhfh8URYR0TzROoQkYBH8sYE7ZQ2OHqfLOWFywoK/z50gPwgnQi5tts7ZTaMt7GlzkmyhE3++s+G+xKSWbP1erptZUPH8Vo9GXOJOPkGoDmpg==; Received: from macmini.bk.cs.huji.ac.il ([132.65.179.19]) by kabab.cs.huji.ac.il with esmtp id 1iVu5Z-000EDJ-Ie; Sat, 16 Nov 2019 11:09:53 +0200 Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 13.0 \(3601.0.10\)) Subject: Re: can the hardware watchdog reboot a hung kernel? From: Daniel Braniss In-Reply-To: <9df4efbbbbb4fd4be81b94894f225c7ec92cc608.camel@freebsd.org> Date: Sat, 16 Nov 2019 11:09:53 +0200 Cc: freebsd-hackers Content-Transfer-Encoding: quoted-printable Message-Id: <8ACEB61A-E76F-4226-B2F7-5AD753457002@cs.huji.ac.il> References: <9cded04a-9ae1-881e-3962-7ef0322e96ed@grosbein.net> <2AD912BF-97B0-421D-B561-722D74864DC9@cs.huji.ac.il> <828605fef472e04311c83a7de0d1f4df429ae717.camel@freebsd.org> <9df4efbbbbb4fd4be81b94894f225c7ec92cc608.camel@freebsd.org> To: Ian Lepore X-Mailer: Apple Mail (2.3601.0.10) X-Rspamd-Queue-Id: 47FTv80gw1z3FSq X-Spamd-Bar: ----- Authentication-Results: mx1.freebsd.org; none X-Spamd-Result: default: False [-5.98 / 15.00]; NEURAL_HAM_MEDIUM(-0.98)[-0.982,0]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; REPLY(-4.00)[] X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 16 Nov 2019 09:09:56 -0000 > On 15 Nov 2019, at 19:11, Ian Lepore wrote: >=20 > On Fri, 2019-11-15 at 18:58 +0200, Daniel Braniss wrote: >>> On 14 Nov 2019, at 20:19, Ian Lepore wrote: >>>=20 >>>=20 > [...] >>>=20 >>> One thing to be careful of here is multicore systems. If you have >>> a >>> critical app running on a multicore system, that app can hang >>> (maybe it >>> tries to read from a device that has malfunctioned and essentially >>> gets >>> hung forever in a device driver that doesn't implement timeouts >>> very >>> well or something). In that case, only one core is hung, so >>> watchdogd >>> will be able to keep petting the dog to prevent a reboot, but since >>> your app is hung on a different core, you aren't really getting the >>> protection you need. >>>=20 >>> The fix for that is to either turn you app into watchdogd (have it >>> make >>> the periodic ioctl() calls to pet the dog), or use the '-e cmd' >>> option >>> with watchdogd, and make 'cmd' be a script that somehow verifies >>> that >>> your critical application is still running properly. >>>=20 >>> =E2=80=94Ian >>=20 >> in my case the kernel is hung, probably by my app - which is using 2 >> i2c devices, , BTW, this does not happen very often,=20 >> maybe once a month, but is annoying. >>=20 >> now the watchdog stuff: >> 1- the all winner/nanopi neo can only handle up to 8 sec timeout (the >> next is 16sec (2^34)) >> the watchdogd complainsif >8sec: >> aw_wdog0: Can't arm, timeout is more than 16 sec >> and continues trying - IMHO it should exit. >>=20 >=20 > This basically comes down to "know your hardware and don't ask for > things it can't do". There is a lot of variance in watchdog hardware, > and unfortunately our watchdog software interface is kinda braindead.=20= > It uses a power-of-2 timeout which is great if you need a large = variety > of subsecond timeouts ranging from a few nanoseconds to a half second.=20= > But it's absolutely horrible for what the real world usually wants:=20 > some medium-sized integer number of seconds. Your choices are pretty > much just 8, 16, 32, 64, 128. Lots of hardware maxes at 16 or 32 > seconds. >=20 > If aw maxes at 16 it's probably best to set it for that, with petting > at either 4 or 8 second intervals. >=20 >> 2- this is a bit more annoying: >> entering the debugger will trigger the timeout and it will the >> perform a clean reboot (*) >=20 > In the debugger, enter "watchdog" without any parameter to disable the > watchdog. (Or give a parameter to change the timeout.) >=20 > Some watchdog hardware cannot be disabled once you've enabled it. >=20 >> doing a shutdown -r leaves the watchdog in some weird state so >> the reboot hangs when starting the watchdog >> win some, loose some :-) >>=20 >=20 > This is likely another flavor of "some watchdog hardware cannot be > disabled". But it might just be a bug in the aw watchdog driver too. >=20 > =E2=80=94Ian >=20 >=20 i have a workaround, start the watchdogd by hand (not via rc.conf) then shutdown does not = stop the watchdog, and all is ok I guess there must be some bug in the reset logic in aw_dog.c danny