From owner-freebsd-hackers@freebsd.org Thu Nov 14 18:19:59 2019 Return-Path: Delivered-To: freebsd-hackers@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 56B1B1AF99F for ; Thu, 14 Nov 2019 18:19:59 +0000 (UTC) (envelope-from ian@freebsd.org) Received: from outbound3d.ore.mailhop.org (outbound3d.ore.mailhop.org [54.186.57.195]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (4096 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 47DVBl0Vq8z4KHj for ; Thu, 14 Nov 2019 18:19:58 +0000 (UTC) (envelope-from ian@freebsd.org) ARC-Seal: i=1; a=rsa-sha256; t=1573755598; cv=none; d=outbound.mailhop.org; s=arc-outbound20181012; b=ANJFWXooSFYKAh98KjEvideyHKTmhssDg6/eTKwiikvBFhuq73fRILk8uFUHb+UjEgLSlFrRDFmvP Wl85Exz8cdSX67DkuNc8jJ0wRWDVzWQ53dkhS2Xrm5fQqnWu90JmjXO3spPrnkDUOzkUqGbcxxscY0 vJACN1i76CvWqv4Qb3GdrVov9qZCi6vqX//r6arBOnX34u7ax6WaARARsb+I4xKHkNrqlgCzCJZESE 0jbOkYWQ99jc5cMJlVOuy4Hn2VsxFKiSkp90ZeKEGVFGjSLvuskQsflEAA5FceStZU72SOSke5F4uK ipF5sDdvekQnRcXI8b2H1pA1lXIKetw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=outbound.mailhop.org; s=arc-outbound20181012; h=content-transfer-encoding:mime-version:content-type:references:in-reply-to: date:cc:to:from:subject:message-id:dkim-signature:from; bh=fYuAWyq8YPq0kYQJX88IirYwWf45UAwBCxIeEyAqtAI=; b=psI8n8rjOFozhdQwCSY+0F/drSurIz7ShplKbid6c0jP1KckBw4KLpec2fFBhBAmpJEX3GjdQNZYl z7QaFAMNR/CwfbknBK6PyTfP4hfBK7FZhq1TKRCB3MXEsgOWs31GquKjzuboBYVs53Grl2hb3qY+R+ ovCje33+x19y0kfcObkkPJsxMgVfPsxt64WgkiF/qz6Hl6Tu8ICIhc9k26uJvH4DTyMoEyFhyzJ4DZ wgCiBY1/y5nQKK1WQskNsSrArxZn7xByPLH8oFR3+qIdWDy4fRnLFKzUZYTrv9Jfv7+hqGlxmTcihx 8UsBjBv9Ua0ySaSuf2W4AvGzdSITztg== ARC-Authentication-Results: i=1; outbound3.ore.mailhop.org; spf=softfail smtp.mailfrom=freebsd.org smtp.remote-ip=67.177.211.60; dmarc=none header.from=freebsd.org; arc=none header.oldest-pass=0; DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=outbound.mailhop.org; s=dkim-high; h=content-transfer-encoding:mime-version:content-type:references:in-reply-to: date:cc:to:from:subject:message-id:from; bh=fYuAWyq8YPq0kYQJX88IirYwWf45UAwBCxIeEyAqtAI=; b=NdsxqVZEmD5D4jIiHpTJ0gZAK6m6tSSLOA1FFIkYryDvkrAYNwJPOhq6/5vGBBzfnvXwOohqLT2J9 1WXt/f6p3H5/vP09hVnIXZsYF2B7KDjQLu7aehbfev/G4ILRgTQnELX2ts6xbFFLa7PZlFmvvMLPHj k2x3syVGUEiYCd2CEZ2ULkO7fdKmdwpscmU2G5djduOzAbroe0Oj2Z99TCNvPh/Kl7QVfHLkpd6Mh+ KPvMjKFWb0s6ZQ03zFJU4fiPn0XAS60q2QkQEClx5oiGnlhLpohywlYc7r/7BgZTbpKWW602ascH25 20PByq/l4N0d3CSYwSJYwX3/Xwk7h3g== X-MHO-RoutePath: aGlwcGll X-MHO-User: 5a686a4a-070b-11ea-b80c-052b4a66b6b2 X-Report-Abuse-To: https://support.duocircle.com/support/solutions/articles/5000540958-duocircle-standard-smtp-abuse-information X-Originating-IP: 67.177.211.60 X-Mail-Handler: DuoCircle Outbound SMTP Received: from ilsoft.org (unknown [67.177.211.60]) by outbound3.ore.mailhop.org (Halon) with ESMTPSA id 5a686a4a-070b-11ea-b80c-052b4a66b6b2; Thu, 14 Nov 2019 18:19:56 +0000 (UTC) Received: from rev (rev [172.22.42.240]) by ilsoft.org (8.15.2/8.15.2) with ESMTP id xAEIJqug028268; Thu, 14 Nov 2019 11:19:52 -0700 (MST) (envelope-from ian@freebsd.org) Message-ID: Subject: Re: can the hardware watchdog reboot a hung kernel? From: Ian Lepore To: Daniel Braniss Cc: freebsd-hackers Date: Thu, 14 Nov 2019 11:19:52 -0700 In-Reply-To: References: <9cded04a-9ae1-881e-3962-7ef0322e96ed@grosbein.net> <2AD912BF-97B0-421D-B561-722D74864DC9@cs.huji.ac.il> <828605fef472e04311c83a7de0d1f4df429ae717.camel@freebsd.org> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.28.5 FreeBSD GNOME Team Mime-Version: 1.0 Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 47DVBl0Vq8z4KHj X-Spamd-Bar: - Authentication-Results: mx1.freebsd.org; none X-Spamd-Result: default: False [-1.92 / 15.00]; local_wl_from(0.00)[freebsd.org]; NEURAL_HAM_MEDIUM(-0.92)[-0.917,0]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; ASN(0.00)[asn:16509, ipnet:54.186.0.0/15, country:US] X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Nov 2019 18:19:59 -0000 On Thu, 2019-11-14 at 20:10 +0200, Daniel Braniss wrote: > > On 14 Nov 2019, at 18:02, Ian Lepore wrote: > > > > On Thu, 2019-11-14 at 17:35 +0200, Daniel Braniss wrote: > > > > On 14 Nov 2019, at 17:28, Eugene Grosbein > > > > wrote: > > > > > > > > 14.11.2019 21:52, Daniel Braniss wrote: > > > > > > > > > hi, > > > > > I have serveral hundred Nano-pi NEO running, and sometimes they > > > > > hang, since there is no console > > > > > available, the only solution is to do a power cycle - not so easy > > > > > since they are distributed in three buildings :-) > > > > > > > > > > I am looking at the watchdog stuff, but it seems that what I want > > > > > is not supported, i.e. > > > > > reboot the kernel when hung > > > > > > > > > > wishful thinking? > > > > > > > > It's possible if the hardware has such a watchdog and kernel > > > > subsystem watchdog(4) supports it. > > > > rc.conf(5) manual page describes watchdogd_enable option. > > > > > > > > > > yes, but it relys on user land, what if the kernel is hung? > > > > > > > It relies on the userland daemon to issue the ioctl() calls to pet the > > dog. If the kernel is hung, then userland code isn't going to run > > either, and the watchdog petting won't happen, and eventually the > > hardware reboots. > > > > We use this at $work specifically to reboot if the kernel hangs, using > > this config: > > > > watchdogd_enable=YES > > watchdogd_flags="-s 16 -t 64 -x 64" > > > > That says the daemon should pet the dog every 16 seconds, and the > > hardware is programmed to reboot if 64 seconds elapses without petting. > > In addition, when watchdogd is shutdown normally (like during a normal > > system reboot) it doesn't disable the watchdog hardware, it sets the > > timeout to 64s to protect against any kind of hang during the reboot. > > The -t and -x times can be different, 64s just happens to work well for > > us in both cases. > > > > -- Ian > > > > ok, that is very encouraging, now a last question > how can i hang the kernel to test that the watchdog kicks in? apart from writing a kernel module :-) > One thing to be careful of here is multicore systems. If you have a critical app running on a multicore system, that app can hang (maybe it tries to read from a device that has malfunctioned and essentially gets hung forever in a device driver that doesn't implement timeouts very well or something). In that case, only one core is hung, so watchdogd will be able to keep petting the dog to prevent a reboot, but since your app is hung on a different core, you aren't really getting the protection you need. The fix for that is to either turn you app into watchdogd (have it make the periodic ioctl() calls to pet the dog), or use the '-e cmd' option with watchdogd, and make 'cmd' be a script that somehow verifies that your critical application is still running properly. -- Ian