From owner-freebsd-hackers@freebsd.org Fri Nov 15 04:24:19 2019 Return-Path: Delivered-To: freebsd-hackers@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 4B1511BC214 for ; Fri, 15 Nov 2019 04:24:19 +0000 (UTC) (envelope-from darius@dons.net.au) Received: from ipmail01.adl6.internode.on.net (ipmail01.adl6.internode.on.net [150.101.137.136]) by mx1.freebsd.org (Postfix) with ESMTP id 47Dlc034QFz3GcK for ; Fri, 15 Nov 2019 04:24:15 +0000 (UTC) (envelope-from darius@dons.net.au) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: =?us-ascii?q?A2D+AACO3/Jb/2hwAg5iGgEBAQEBAgEBA?= =?us-ascii?q?QEHAgEBAQGBZYIEgVYSJ4xvix0BSQIBAQEBAQEGgRAhBIkVkBsLAQErAYRAAoN?= =?us-ascii?q?sIzgSAQMBAQIBAQJtKIU8AQEBAQIBLA4cIwULCw4KLjkeBhODIYF1BQeodx6Jf?= =?us-ascii?q?hOMCXiBB4E4DBOCTIR+gzWCJgKfbwkCkSUYgViIK4Z6lRWCWgIKBxSBXSGBVWw?= =?us-ascii?q?ZZQGCQT6QLywBMoEFAQGNVQEB?= Received: from ppp14-2-112-104.adl-apt-pir-bras32.tpg.internode.on.net (HELO midget.dons.net.au) ([14.2.112.104]) by ipmail01.adl6.internode.on.net with ESMTP; 15 Nov 2019 14:54:10 +1030 Received: from midget.dons.net.au (localhost [127.0.0.1]) by midget.dons.net.au (8.15.2/8.15.2) with ESMTPS id xAF4NtGO003081 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NO) for ; Fri, 15 Nov 2019 14:54:05 +1030 (ACDT) (envelope-from darius@dons.net.au) Received: (from mailnull@localhost) by midget.dons.net.au (8.15.2/8.15.2/Submit) id xAF42ouY088533 for ; Fri, 15 Nov 2019 14:32:50 +1030 (ACDT) (envelope-from darius@dons.net.au) X-Authentication-Warning: midget.dons.net.au: mailnull set sender to using -f X-MIMEDefang-Relay-be813b1f1da6d6b27d681222cb70cc4f5b642383: 203.31.81.177 Received: from havok.gsoft.com.au (Havok.gsoft.com.au [203.31.81.177]) by ppp14-2-112-104.adl-apt-pir-bras32.tpg.internode.on.net (envelope-sender ) (MIMEDefang) with ESMTP id xAF42ifU088529; Fri, 15 Nov 2019 14:32:50 +1030 Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\)) Subject: Re: can the hardware watchdog reboot a hung kernel? From: "O'Connor, Daniel" In-Reply-To: Date: Fri, 15 Nov 2019 14:32:44 +1030 Cc: Ian Lepore , Daniel Braniss , freebsd-hackers Content-Transfer-Encoding: quoted-printable Message-Id: <92134BA3-3BB3-4377-B9A7-1B1D702824F7@dons.net.au> References: <9cded04a-9ae1-881e-3962-7ef0322e96ed@grosbein.net> <2AD912BF-97B0-421D-B561-722D74864DC9@cs.huji.ac.il> <828605fef472e04311c83a7de0d1f4df429ae717.camel@freebsd.org> To: Eugene Grosbein X-Mailer: Apple Mail (2.3445.104.11) X-Spam-Score: 0.4 () No, score=0.4 required=5.0 tests=KHOP_HELO_FCRDNS, SPF_HELO_NONE, SPF_NONE autolearn=no autolearn_force=no version=3.4.2 X-Scanned-By: MIMEDefang 2.83 on 10.0.2.1 X-Rspamd-Queue-Id: 47Dlc034QFz3GcK X-Spamd-Bar: +++++ Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=none (mx1.freebsd.org: domain of darius@dons.net.au has no SPF policy when checking 150.101.137.136) smtp.mailfrom=darius@dons.net.au X-Spamd-Result: default: False [5.88 / 15.00]; ARC_NA(0.00)[]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[4]; MV_CASE(0.50)[]; MIME_GOOD(-0.10)[text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-hackers@freebsd.org]; HAS_XAW(0.00)[]; AUTH_NA(1.00)[]; DMARC_NA(0.00)[dons.net.au]; RCVD_COUNT_THREE(0.00)[4]; MIME_TRACE(0.00)[0:+]; TO_DN_ALL(0.00)[]; NEURAL_SPAM_MEDIUM(0.98)[0.979,0]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_SPAM_LONG(1.00)[1.000,0]; R_SPF_NA(0.00)[]; RCVD_NO_TLS_LAST(0.10)[]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; SUBJECT_ENDS_QUESTION(1.00)[]; ASN(0.00)[asn:4739, ipnet:150.101.0.0/16, country:AU]; MID_RHS_MATCH_FROM(0.00)[]; IP_SCORE(1.50)[ip: (3.78), ipnet: 150.101.0.0/16(2.49), asn: 4739(1.25), country: AU(0.00)]; RCVD_IN_DNSWL_LOW(-0.10)[136.137.101.150.list.dnswl.org : 127.0.5.1] X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Nov 2019 04:24:19 -0000 > On 15 Nov 2019, at 14:29, Eugene Grosbein wrote: >=20 > 15.11.2019 1:19, Ian Lepore wrote: >=20 >> One thing to be careful of here is multicore systems. If you have a >> critical app running on a multicore system, that app can hang (maybe = it >> tries to read from a device that has malfunctioned and essentially = gets >> hung forever in a device driver that doesn't implement timeouts very >> well or something). In that case, only one core is hung, so = watchdogd >> will be able to keep petting the dog to prevent a reboot, but since >> your app is hung on a different core, you aren't really getting the >> protection you need. >>=20 >> The fix for that is to either turn you app into watchdogd (have it = make >> the periodic ioctl() calls to pet the dog), or use the '-e cmd' = option >> with watchdogd, and make 'cmd' be a script that somehow verifies that >> your critical application is still running properly. >=20 > I have not tried it myself, but there may be easier way > if the app is single-process and single-threaded: use cpuset(1) to = bind > both of the app and watchdogd to same core. You can get watchdogd to run a script, so you could have it check for = liveness somehow and the dog will bite if it isn't. -- Daniel O'Connor "The nice thing about standards is that there are so many of them to choose from." -- Andrew Tanenbaum