From owner-freebsd-hackers@freebsd.org Fri Nov 15 03:59:55 2019 Return-Path: Delivered-To: freebsd-hackers@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 4BE611BB9B3 for ; Fri, 15 Nov 2019 03:59:55 +0000 (UTC) (envelope-from eugen@grosbein.net) Received: from hz.grosbein.net (hz.grosbein.net [IPv6:2a01:4f8:c2c:26d8::2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "hz.grosbein.net", Issuer "hz.grosbein.net" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 47Dl3s3wvgz4LLs; Fri, 15 Nov 2019 03:59:53 +0000 (UTC) (envelope-from eugen@grosbein.net) Received: from eg.sd.rdtc.ru (eg.sd.rdtc.ru [IPv6:2a03:3100:c:13:0:0:0:5]) by hz.grosbein.net (8.15.2/8.15.2) with ESMTPS id xAF3xLOh008435 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 15 Nov 2019 03:59:26 GMT (envelope-from eugen@grosbein.net) X-Envelope-From: eugen@grosbein.net X-Envelope-To: ian@freebsd.org Received: from [10.58.0.4] (dadv@[10.58.0.4]) by eg.sd.rdtc.ru (8.15.2/8.15.2) with ESMTPS id xAF3xD2C007301 (version=TLSv1.2 cipher=DHE-RSA-AES128-SHA bits=128 verify=NOT); Fri, 15 Nov 2019 10:59:13 +0700 (+07) (envelope-from eugen@grosbein.net) Subject: Re: can the hardware watchdog reboot a hung kernel? To: Ian Lepore , Daniel Braniss References: <9cded04a-9ae1-881e-3962-7ef0322e96ed@grosbein.net> <2AD912BF-97B0-421D-B561-722D74864DC9@cs.huji.ac.il> <828605fef472e04311c83a7de0d1f4df429ae717.camel@freebsd.org> Cc: freebsd-hackers From: Eugene Grosbein Message-ID: Date: Fri, 15 Nov 2019 10:59:13 +0700 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=0.3 required=5.0 tests=BAYES_00,LOCAL_FROM, SPF_HELO_NONE,SPF_PASS,T_DATE_IN_FUTURE_Q_PLUS autolearn=no autolearn_force=no version=3.4.2 X-Spam-Report: * -2.3 BAYES_00 BODY: Bayes spam probability is 0 to 1% * [score: 0.0000] * 0.0 SPF_HELO_NONE SPF: HELO does not publish an SPF Record * 0.0 T_DATE_IN_FUTURE_Q_PLUS Date: is over 4 months after Received: * date * -0.0 SPF_PASS SPF: sender matches SPF record * 2.6 LOCAL_FROM From my domains X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on hz.grosbein.net X-Rspamd-Queue-Id: 47Dl3s3wvgz4LLs X-Spamd-Bar: -- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=permerror (mx1.freebsd.org: domain of eugen@grosbein.net uses mechanism not recognized by this client) smtp.mailfrom=eugen@grosbein.net X-Spamd-Result: default: False [-2.69 / 15.00]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-0.998,0]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; MIME_GOOD(-0.10)[text/plain]; MIME_TRACE(0.00)[0:+]; DMARC_NA(0.00)[grosbein.net]; RCVD_COUNT_THREE(0.00)[3]; TO_MATCH_ENVRCPT_SOME(0.00)[]; TO_DN_ALL(0.00)[]; R_SPF_PERMFAIL(0.00)[]; IP_SCORE(-1.59)[ip: (-4.00), ipnet: 2a01:4f8::/29(-2.29), asn: 24940(-1.65), country: DE(-0.01)]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; SUBJECT_ENDS_QUESTION(1.00)[]; ASN(0.00)[asn:24940, ipnet:2a01:4f8::/29, country:DE]; MID_RHS_MATCH_FROM(0.00)[]; RCVD_TLS_ALL(0.00)[] X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Nov 2019 03:59:55 -0000 15.11.2019 1:19, Ian Lepore wrote: > One thing to be careful of here is multicore systems. If you have a > critical app running on a multicore system, that app can hang (maybe it > tries to read from a device that has malfunctioned and essentially gets > hung forever in a device driver that doesn't implement timeouts very > well or something). In that case, only one core is hung, so watchdogd > will be able to keep petting the dog to prevent a reboot, but since > your app is hung on a different core, you aren't really getting the > protection you need. > > The fix for that is to either turn you app into watchdogd (have it make > the periodic ioctl() calls to pet the dog), or use the '-e cmd' option > with watchdogd, and make 'cmd' be a script that somehow verifies that > your critical application is still running properly. I have not tried it myself, but there may be easier way if the app is single-process and single-threaded: use cpuset(1) to bind both of the app and watchdogd to same core.