From owner-freebsd-stable@freebsd.org Wed Feb 24 13:37:40 2016 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0EE01AB24D0 for ; Wed, 24 Feb 2016 13:37:40 +0000 (UTC) (envelope-from mikej@mikej.com) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id E71E5117 for ; Wed, 24 Feb 2016 13:37:39 +0000 (UTC) (envelope-from mikej@mikej.com) Received: by mailman.ysv.freebsd.org (Postfix) id E2F94AB24CE; Wed, 24 Feb 2016 13:37:39 +0000 (UTC) Delivered-To: stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E263DAB24CD; Wed, 24 Feb 2016 13:37:39 +0000 (UTC) (envelope-from mikej@mikej.com) Received: from mx2.paymentallianceintl.com (mx2.paymentallianceintl.com [216.26.158.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "mx2.paymentallianceintl.com", Issuer "Go Daddy Secure Certificate Authority - G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id BB62A116; Wed, 24 Feb 2016 13:37:39 +0000 (UTC) (envelope-from mikej@mikej.com) Received: from firewall.mikej.com (162-230-214-65.lightspeed.lsvlky.sbcglobal.net [162.230.214.65]) by mx2.paymentallianceintl.com (8.15.1/8.15.1) with ESMTPS id u1ODbSpc091753 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 24 Feb 2016 08:37:29 -0500 (EST) (envelope-from mikej@mikej.com) X-Authentication-Warning: mx2.paymentallianceintl.com: Host 162-230-214-65.lightspeed.lsvlky.sbcglobal.net [162.230.214.65] claimed to be firewall.mikej.com Received: from mail.mikej.com (firewall.mikej.com [192.168.6.63]) by firewall.mikej.com (8.15.2/8.15.2) with ESMTP id u1ODb4He013491; Wed, 24 Feb 2016 08:37:04 -0500 (EST) (envelope-from mikej@mikej.com) DMARC-Filter: OpenDMARC Filter v1.3.1 firewall.mikej.com u1ODb4He013491 Authentication-Results: mail.mikej.com; dmarc=none header.from=mikej.com DKIM-Filter: OpenDKIM Filter v2.10.3 firewall.mikej.com u1ODb4He013491 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mikej.com; s=mail; t=1456321025; bh=7ArYojPo37OvLHjR38nv/nnY6UXy9W/uBLdxlZQz0Jg=; h=Date:From:To:Cc:Subject:In-Reply-To:References; b=sb+wV0q4TIaYgFsAx+sXAPSlPoILHOgf8P+SlUsMdrrHeNhAWAREF1Cy+7PcKKbXd NX94plLCH9zjsupHpn3iqPmH19W+loB3K0Lli96WHZMbMhsqRJ6oPFgWjVCUy+3T/I d8biAFVipAB/0OxDTFBbuUkyjkA8XcETkiQ/LCn7BKFqI9ebsZwC+cjcec9/FgDMdb qSddbHlpRB9hE56mjWZBo4iF04rI2hXEsxDaXFvwNFQPgutN5ZVMlRor6oD7lPHO/N excmihepYO3+vjg0Lwf+kaXgO4ebJ4Q0vb8MTiXJla8wgKzWEIbYFbQYHNvPAKJh/N 4WSSAv+Lyfkcg== X-Authentication-Warning: firewall.mikej.com: Host firewall.mikej.com [192.168.6.63] claimed to be mail.mikej.com MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Date: Wed, 24 Feb 2016 08:37:04 -0500 From: Michael Jung To: Konstantin Belousov Cc: Paul Koch , stable@freebsd.org, owner-freebsd-stable@freebsd.org Subject: Re: 10.2 - Process stuck in unkillable sleep In-Reply-To: <20160224131818.GO91220@kib.kiev.ua> References: <20160224142619.6710b6c1@akips.com> <20160224131818.GO91220@kib.kiev.ua> Message-ID: <7fed68a8927e70d4d9cc6ea1a8ddd1bf@mail.mikej.com> X-Sender: mikej@mikej.com User-Agent: Roundcube Webmail/1.1.4 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 24 Feb 2016 13:37:40 -0000 On 2016-02-24 08:18, Konstantin Belousov wrote: > On Wed, Feb 24, 2016 at 02:26:19PM +1000, Paul Koch wrote: >> >> Occasionally we see a process get stuck in an unkillable state and >> the only solution is a hard reboot. >> >> Occasionally == once every two weeks across 60+ servers, which are >> spread >> across the globe in customer sites. We have no remote access to these >> boxes. >> >> The process that most often that gets stuck, but not limited to, is a >> large >> scale Ping/SNMP poller. It is a fairly simplistic C program that just >> fires >> out lots of ping (raw ICMP socket) and SNMP (UDP socket) requests >> asynchronously. >> >> We've managed to trap the problem a few times on a test server running >> in >> VirtualBox, but it also occurs on customer sites who run VMware, >> Hyper-V, >> QEMU and on bare metal. >> >> >> We raise this PR >> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204081 >> >> but suspect it is a similar/same issue as >> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200992 >> >> This is the info we've gathered from the most recent time it has >> occurred: >> >> >> # uname -a >> FreeBSD shed153.akips.com 10.2-RELEASE-p12 FreeBSD 10.2-RELEASE-p12 #0 >> r295070: >> Sat Jan 30 20:03:44 UTC 2016 >> root@shed21.akips.com:/usr/obj/usr/src/sys/GENERIC amd64 > >> # ps auxww | grep nm-poller >> akips 1014 0.0 2.6 871820 106540 - Ds 10Feb16 1078:59.06 >> nm-poller >> >> >> # procstat -k 1014 >> PID TID COMM TDNAME KSTACK >> 1014 100365 nm-poller - mi_switch sleepq_timedwait_sig >> _cv_timedwait_sig_sbt seltdwait kern_select sys_select amd64_syscall >> Xfast_syscall >> > > Yes, on HEAD it was reported that the https://reviews.freebsd.org/D5221 > fixed the problem. Still not reviewed. > > I did back-port to stable/10, the patch below is probably not > applicable > to 10.2, you would need 10.3 for it. Some revisions are missed from > stable/10, but I think that the issue worked around in the patch is at > the core of troubles many people reported. > > Index: sys/kern/kern_timeout.c > =================================================================== > --- sys/kern/kern_timeout.c (revision 295966) > +++ sys/kern/kern_timeout.c (working copy) > @@ -1127,7 +1127,7 @@ _callout_stop_safe(c, safe) > * Some old subsystems don't hold Giant while running a > callout_stop(), > * so just discard this check for the moment. > */ > - if (!safe && c->c_lock != NULL) { > + if ((safe & CS_DRAIN) == 0 && c->c_lock != NULL) { > if (c->c_lock == &Giant.lock_object) > use_lock = mtx_owned(&Giant); > else { > @@ -1207,7 +1207,7 @@ again: > return (0); > } > > - if (safe) { > + if ((safe & CS_DRAIN) != 0) { > /* > * The current callout is running (or just > * about to run) and blocking is allowed, so > @@ -1319,7 +1319,7 @@ again: > CTR3(KTR_CALLOUT, "postponing stop %p func %p arg %p", > c, c->c_func, c->c_arg); > CC_UNLOCK(cc); > - return (0); > + return ((safe & CS_MIGRBLOCK) != 0); > } > CTR3(KTR_CALLOUT, "failed to stop %p func %p arg %p", > c, c->c_func, c->c_arg); > Index: sys/kern/subr_sleepqueue.c > =================================================================== > --- sys/kern/subr_sleepqueue.c (revision 295966) > +++ sys/kern/subr_sleepqueue.c (working copy) > @@ -572,7 +572,8 @@ sleepq_check_timeout(void) > * another CPU, so synchronize with it to avoid having it > * accidentally wake up a subsequent sleep. > */ > - else if (callout_stop(&td->td_slpcallout) == 0) { > + else if (_callout_stop_safe(&td->td_slpcallout, CS_MIGRBLOCK) > + == 0) { > td->td_flags |= TDF_TIMEOUT; > TD_SET_SLEEPING(td); > mi_switch(SW_INVOL | SWT_SLEEPQTIMO, NULL); > Index: sys/sys/callout.h > =================================================================== > --- sys/sys/callout.h (revision 295966) > +++ sys/sys/callout.h (working copy) > @@ -62,6 +62,9 @@ struct callout_handle { > struct callout *callout; > }; > > +#define CS_DRAIN 0x0001 > +#define CS_MIGRBLOCK 0x0002 > + > #ifdef _KERNEL > /* > * Note the flags field is actually *two* fields. The c_flags > @@ -81,7 +84,7 @@ struct callout_handle { > */ > #define callout_active(c) ((c)->c_flags & CALLOUT_ACTIVE) > #define callout_deactivate(c) ((c)->c_flags &= ~CALLOUT_ACTIVE) > -#define callout_drain(c) _callout_stop_safe(c, 1) > +#define callout_drain(c) _callout_stop_safe(c, CS_DRAIN) > void callout_init(struct callout *, int); > void _callout_init_lock(struct callout *, struct lock_object *, int); > #define callout_init_mtx(c, mtx, flags) \ > > _______________________________________________ > freebsd-stable@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to > "freebsd-stable-unsubscribe@freebsd.org" I'm not sure if I have the same of different issue. According to top my process is stuck in "STOP" state. FreeBSD firewall.mikej.com 10.2-STABLE FreeBSD 10.2-STABLE #22 r289078M: Wed Dec 9 17:13:31 EST 2015 mikej@firewall.mikej.com:/usr/obj/usr/src/sys/GENERIC amd64 42152 emby 2 20 -20 869M 1424K STOP 4 166:22 0.00% mono-sgen root@firewall:/usr/ports/devel # procstat -kk 42152 PID TID COMM TDNAME KSTACK 42152 101501 mono-sgen - mi_switch+0xe1 thread_suspend_switch+0x170 thread_single+0x4e5 exit1+0xbe sigexit+0x925 postsig+0x286 ast+0x427 doreti_ast+0x1f 42152 101511 mono-sgen - mi_switch+0xe1 sleepq_timedwait_sig+0x8b _sleep+0x238 umtxq_sleep+0x125 do_wait+0x387 __umtx_op_wait_uint_private+0x83 amd64_syscall+0x35d Xfast_syscall+0xfb root@firewall:/usr/ports/devel # kill -9 42152 has no affect. I tried to stop the process with /usr/local/etc/rc.d/emby-server stop emby-server-3.0.5821 mono-4.2.2.10 If this is different issue please let me know and I will open a separate PR. Thank you. --mikej