From owner-freebsd-hackers@freebsd.org Fri Nov 15 17:11:09 2019 Return-Path: Delivered-To: freebsd-hackers@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id DF4ED1AB22E for ; Fri, 15 Nov 2019 17:11:09 +0000 (UTC) (envelope-from ian@freebsd.org) Received: from outbound3d.ore.mailhop.org (outbound3d.ore.mailhop.org [54.186.57.195]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (4096 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 47F4cs3Rydz4HQY for ; Fri, 15 Nov 2019 17:11:09 +0000 (UTC) (envelope-from ian@freebsd.org) ARC-Seal: i=1; a=rsa-sha256; t=1573837867; cv=none; d=outbound.mailhop.org; s=arc-outbound20181012; b=QF4i1y/h31AxpDVAys9Z/zN99yXv8eYmHSB7ecy0DZiPCVeikNG4ESzVmOCQzADJIGzmlHNgA+X9+ EhvV1qTp1DZygX4hSczMY7sL3hUtYP8xJDkC1ImyojO920oWHCDYg+05XWYQ7/hMxHLOwgndWt/PeQ ZxJ6OHg8NTJ965RlYtTxcq3Zu18dogXKV50fEafQexakI65YjHUfDxKWZavOXAQ7TB68RLm83POB5e 3GcZFcY+PK3PbnyPW21lp8b9c8ulbdRG+xWjqMT8L/iV652uoQNcoyBtZd+tDKCR+Y81CaAaUoFfPe 4hxq3mkRhvZAAag4fAmYvbf75rPvvBA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=outbound.mailhop.org; s=arc-outbound20181012; h=content-transfer-encoding:mime-version:content-type:references:in-reply-to: date:cc:to:from:subject:message-id:dkim-signature:from; bh=9CPLTTKgVol/xnhZs1mogZGTTUKXDmjce7CuSR4W3/Q=; b=SG53Mf6ic0LRiXmHKCS1BN9ah/j8K5QdqkjqywMMHvxPId6/J0IPJzMJLBNHHK+NBZm6u3WSXP2Pk toMZlG9YaWjvX0n/O4iiCs3J9dQA4vmnUziZWq9jrDIskMB8UTtop6C13v6l8xlBGn6pOXPxdtRdSb YYt/cCIgBR8GjPlk5HzirnaeTRfMnvjuE+hljHAoK9McMmu0u2V4ymmR/CxyfNMbuxOIB1QjxQTCBX n8gzyp7BxQQAe0W3ZkJlqb4i8FYaqL3C0MO974qwWHMT5kQ1i2YyhWedMc4xluDVVEDd56/HW/WMb1 urKECvXR9Zv9wL17u+gmkrwwQvQrRLg== ARC-Authentication-Results: i=1; outbound3.ore.mailhop.org; spf=softfail smtp.mailfrom=freebsd.org smtp.remote-ip=67.177.211.60; dmarc=none header.from=freebsd.org; arc=none header.oldest-pass=0; DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=outbound.mailhop.org; s=dkim-high; h=content-transfer-encoding:mime-version:content-type:references:in-reply-to: date:cc:to:from:subject:message-id:from; bh=9CPLTTKgVol/xnhZs1mogZGTTUKXDmjce7CuSR4W3/Q=; b=SnqBcKWirona161bnjKNwOCzS+VdzTrsCR+y4SDpxQ4ekkVph3CsZcTLZiSH3Vh4BWvCxnNrrcdQd 0qXmwc+3+PJnKHD7Z0vyI5k0gSKTtuCHgBx+nRxzJR3opX+jd0+MkcTfE5tH8Z3yJmwq6kxkjruOv5 m2uC0hbtqpF1K/1Uf2lWrtt1ix0lo4x1/eZ47XEnbSxWfKwyr6bdLvzTbKH2dono5RaaE3Wb4VAObj mWxMDDyc8xJbq9p7sSzmsFGMMJ6vLFPvJX1+3ZJB5ERSK1z8TUXe/E3qJ1Lr7rSEGYlKTJBOXnkSuP w+ZE/6ZEcAo9kBCkerT3st+qkSNIY3Q== X-MHO-RoutePath: aGlwcGll X-MHO-User: e6f55d2b-07ca-11ea-b80c-052b4a66b6b2 X-Report-Abuse-To: https://support.duocircle.com/support/solutions/articles/5000540958-duocircle-standard-smtp-abuse-information X-Originating-IP: 67.177.211.60 X-Mail-Handler: DuoCircle Outbound SMTP Received: from ilsoft.org (unknown [67.177.211.60]) by outbound3.ore.mailhop.org (Halon) with ESMTPSA id e6f55d2b-07ca-11ea-b80c-052b4a66b6b2; Fri, 15 Nov 2019 17:11:06 +0000 (UTC) Received: from rev (rev [172.22.42.240]) by ilsoft.org (8.15.2/8.15.2) with ESMTP id xAFHB28P032161; Fri, 15 Nov 2019 10:11:02 -0700 (MST) (envelope-from ian@freebsd.org) Message-ID: <9df4efbbbbb4fd4be81b94894f225c7ec92cc608.camel@freebsd.org> Subject: Re: can the hardware watchdog reboot a hung kernel? From: Ian Lepore To: Daniel Braniss Cc: freebsd-hackers Date: Fri, 15 Nov 2019 10:11:02 -0700 In-Reply-To: References: <9cded04a-9ae1-881e-3962-7ef0322e96ed@grosbein.net> <2AD912BF-97B0-421D-B561-722D74864DC9@cs.huji.ac.il> <828605fef472e04311c83a7de0d1f4df429ae717.camel@freebsd.org> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.28.5 FreeBSD GNOME Team Mime-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 47F4cs3Rydz4HQY X-Spamd-Bar: - Authentication-Results: mx1.freebsd.org; none X-Spamd-Result: default: False [-1.92 / 15.00]; local_wl_from(0.00)[freebsd.org]; NEURAL_HAM_MEDIUM(-0.92)[-0.917,0]; ASN(0.00)[asn:16509, ipnet:54.186.0.0/15, country:US]; NEURAL_HAM_LONG(-1.00)[-1.000,0] X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Nov 2019 17:11:10 -0000 On Fri, 2019-11-15 at 18:58 +0200, Daniel Braniss wrote: > > On 14 Nov 2019, at 20:19, Ian Lepore wrote: > > > > [...] > > > > One thing to be careful of here is multicore systems. If you have > > a > > critical app running on a multicore system, that app can hang > > (maybe it > > tries to read from a device that has malfunctioned and essentially > > gets > > hung forever in a device driver that doesn't implement timeouts > > very > > well or something). In that case, only one core is hung, so > > watchdogd > > will be able to keep petting the dog to prevent a reboot, but since > > your app is hung on a different core, you aren't really getting the > > protection you need. > > > > The fix for that is to either turn you app into watchdogd (have it > > make > > the periodic ioctl() calls to pet the dog), or use the '-e cmd' > > option > > with watchdogd, and make 'cmd' be a script that somehow verifies > > that > > your critical application is still running properly. > > > > —Ian > > in my case the kernel is hung, probably by my app - which is using 2 > i2c devices, , BTW, this does not happen very often, > maybe once a month, but is annoying. > > now the watchdog stuff: > 1- the all winner/nanopi neo can only handle up to 8 sec timeout (the > next is 16sec (2^34)) > the watchdogd complainsif >8sec: > aw_wdog0: Can't arm, timeout is more than 16 sec > and continues trying - IMHO it should exit. > This basically comes down to "know your hardware and don't ask for things it can't do". There is a lot of variance in watchdog hardware, and unfortunately our watchdog software interface is kinda braindead. It uses a power-of-2 timeout which is great if you need a large variety of subsecond timeouts ranging from a few nanoseconds to a half second. But it's absolutely horrible for what the real world usually wants: some medium-sized integer number of seconds. Your choices are pretty much just 8, 16, 32, 64, 128. Lots of hardware maxes at 16 or 32 seconds. If aw maxes at 16 it's probably best to set it for that, with petting at either 4 or 8 second intervals. > 2- this is a bit more annoying: > entering the debugger will trigger the timeout and it will the > perform a clean reboot (*) In the debugger, enter "watchdog" without any parameter to disable the watchdog. (Or give a parameter to change the timeout.) Some watchdog hardware cannot be disabled once you've enabled it. > doing a shutdown -r leaves the watchdog in some weird state so > the reboot hangs when starting the watchdog > win some, loose some :-) > This is likely another flavor of "some watchdog hardware cannot be disabled". But it might just be a bug in the aw watchdog driver too. -- Ian