From owner-freebsd-stable@freebsd.org Wed Feb 24 13:18:33 2016 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id BE0A7AB3BA8 for ; Wed, 24 Feb 2016 13:18:33 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id AA13414A1 for ; Wed, 24 Feb 2016 13:18:33 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: by mailman.ysv.freebsd.org (Postfix) id A6241AB3BA7; Wed, 24 Feb 2016 13:18:33 +0000 (UTC) Delivered-To: stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A5C07AB3BA6 for ; Wed, 24 Feb 2016 13:18:33 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 37238149E for ; Wed, 24 Feb 2016 13:18:33 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id u1ODIJEH054914 (version=TLSv1 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Wed, 24 Feb 2016 15:18:20 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua u1ODIJEH054914 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id u1ODIIdl054913; Wed, 24 Feb 2016 15:18:18 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Wed, 24 Feb 2016 15:18:18 +0200 From: Konstantin Belousov To: Paul Koch Cc: stable@freebsd.org Subject: Re: 10.2 - Process stuck in unkillable sleep Message-ID: <20160224131818.GO91220@kib.kiev.ua> References: <20160224142619.6710b6c1@akips.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160224142619.6710b6c1@akips.com> User-Agent: Mutt/1.5.24 (2015-08-30) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 24 Feb 2016 13:18:33 -0000 On Wed, Feb 24, 2016 at 02:26:19PM +1000, Paul Koch wrote: > > Occasionally we see a process get stuck in an unkillable state and > the only solution is a hard reboot. > > Occasionally == once every two weeks across 60+ servers, which are spread > across the globe in customer sites. We have no remote access to these boxes. > > The process that most often that gets stuck, but not limited to, is a large > scale Ping/SNMP poller. It is a fairly simplistic C program that just fires > out lots of ping (raw ICMP socket) and SNMP (UDP socket) requests > asynchronously. > > We've managed to trap the problem a few times on a test server running in > VirtualBox, but it also occurs on customer sites who run VMware, Hyper-V, > QEMU and on bare metal. > > > We raise this PR > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204081 > > but suspect it is a similar/same issue as > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200992 > > This is the info we've gathered from the most recent time it has occurred: > > > # uname -a > FreeBSD shed153.akips.com 10.2-RELEASE-p12 FreeBSD 10.2-RELEASE-p12 #0 r295070: > Sat Jan 30 20:03:44 UTC 2016 root@shed21.akips.com:/usr/obj/usr/src/sys/GENERIC amd64 > # ps auxww | grep nm-poller > akips 1014 0.0 2.6 871820 106540 - Ds 10Feb16 1078:59.06 nm-poller > > > # procstat -k 1014 > PID TID COMM TDNAME KSTACK > 1014 100365 nm-poller - mi_switch sleepq_timedwait_sig _cv_timedwait_sig_sbt seltdwait kern_select sys_select amd64_syscall Xfast_syscall > Yes, on HEAD it was reported that the https://reviews.freebsd.org/D5221 fixed the problem. Still not reviewed. I did back-port to stable/10, the patch below is probably not applicable to 10.2, you would need 10.3 for it. Some revisions are missed from stable/10, but I think that the issue worked around in the patch is at the core of troubles many people reported. Index: sys/kern/kern_timeout.c =================================================================== --- sys/kern/kern_timeout.c (revision 295966) +++ sys/kern/kern_timeout.c (working copy) @@ -1127,7 +1127,7 @@ _callout_stop_safe(c, safe) * Some old subsystems don't hold Giant while running a callout_stop(), * so just discard this check for the moment. */ - if (!safe && c->c_lock != NULL) { + if ((safe & CS_DRAIN) == 0 && c->c_lock != NULL) { if (c->c_lock == &Giant.lock_object) use_lock = mtx_owned(&Giant); else { @@ -1207,7 +1207,7 @@ again: return (0); } - if (safe) { + if ((safe & CS_DRAIN) != 0) { /* * The current callout is running (or just * about to run) and blocking is allowed, so @@ -1319,7 +1319,7 @@ again: CTR3(KTR_CALLOUT, "postponing stop %p func %p arg %p", c, c->c_func, c->c_arg); CC_UNLOCK(cc); - return (0); + return ((safe & CS_MIGRBLOCK) != 0); } CTR3(KTR_CALLOUT, "failed to stop %p func %p arg %p", c, c->c_func, c->c_arg); Index: sys/kern/subr_sleepqueue.c =================================================================== --- sys/kern/subr_sleepqueue.c (revision 295966) +++ sys/kern/subr_sleepqueue.c (working copy) @@ -572,7 +572,8 @@ sleepq_check_timeout(void) * another CPU, so synchronize with it to avoid having it * accidentally wake up a subsequent sleep. */ - else if (callout_stop(&td->td_slpcallout) == 0) { + else if (_callout_stop_safe(&td->td_slpcallout, CS_MIGRBLOCK) + == 0) { td->td_flags |= TDF_TIMEOUT; TD_SET_SLEEPING(td); mi_switch(SW_INVOL | SWT_SLEEPQTIMO, NULL); Index: sys/sys/callout.h =================================================================== --- sys/sys/callout.h (revision 295966) +++ sys/sys/callout.h (working copy) @@ -62,6 +62,9 @@ struct callout_handle { struct callout *callout; }; +#define CS_DRAIN 0x0001 +#define CS_MIGRBLOCK 0x0002 + #ifdef _KERNEL /* * Note the flags field is actually *two* fields. The c_flags @@ -81,7 +84,7 @@ struct callout_handle { */ #define callout_active(c) ((c)->c_flags & CALLOUT_ACTIVE) #define callout_deactivate(c) ((c)->c_flags &= ~CALLOUT_ACTIVE) -#define callout_drain(c) _callout_stop_safe(c, 1) +#define callout_drain(c) _callout_stop_safe(c, CS_DRAIN) void callout_init(struct callout *, int); void _callout_init_lock(struct callout *, struct lock_object *, int); #define callout_init_mtx(c, mtx, flags) \