From owner-freebsd-hackers@FreeBSD.ORG Mon Mar 14 05:10:56 2011 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 75AB8106564A for ; Mon, 14 Mar 2011 05:10:56 +0000 (UTC) (envelope-from ravi.murty@gmail.com) Received: from mail-bw0-f54.google.com (mail-bw0-f54.google.com [209.85.214.54]) by mx1.freebsd.org (Postfix) with ESMTP id 052488FC20 for ; Mon, 14 Mar 2011 05:10:55 +0000 (UTC) Received: by bwz12 with SMTP id 12so4637827bwz.13 for ; Sun, 13 Mar 2011 22:10:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:date:message-id:subject:from:to :content-type; bh=uTBIUVIJTUmdechKyTEIg2frZ65jJ5fWUiR1rGx2110=; b=at9X0vi/56XKvArCfspX0Ftj2NQ/zFcolEt2+ze7TBPM4jIP2h7iMn5AaLpynnuGsW JO54olIs6cU55TXBVKBVGaTwl8IBVHnXns33xpkpU+Tmc+Ukk4f2vwqFxzjoVlTpu0Tp 8bG8JtKmVdgUXP3YestVBqY/1jZgxFC2b8KN8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=EGyxMh31hKqDKx/d0ugjTDNc/XbD3Zz1Uu0FNCPT4BeHIyEaV7ei2s4RBQIWcC/CxJ bW+8wOA6Foj3ADnj7zgC52beiwXBcTKRVfT5k6X1GbKTzGaVM8SXxQAU5NMJYHnmql+m sMPBlrVJAspYf0gxmRtVt1O7aAXzTebAO63fY= MIME-Version: 1.0 Received: by 10.204.19.76 with SMTP id z12mr1562762bka.205.1300077682001; Sun, 13 Mar 2011 21:41:22 -0700 (PDT) Received: by 10.204.117.193 with HTTP; Sun, 13 Mar 2011 21:41:21 -0700 (PDT) Date: Sun, 13 Mar 2011 21:41:21 -0700 Message-ID: From: Ravi Murty To: freebsd-hackers@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Subject: SIGSTOP and SIGKILL X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 14 Mar 2011 05:10:56 -0000 Hi everybody, I'm using FreeBSD 8.0 and I seem to have a race condition that is fairly reproducible. Let me try and describe it. The basic idea is that we use SIGSTOP and SIGCONT to stop and restart threads of a process - call it p1. A caller (call it c1) SIGSTOPs and SIGCONTs p1 until another caller (call it c2) decides to come along and kill the process. Both callers grab proc_lock for p1 and use pfind(...) to find the process before subjecting p1 to any of these signals. What I see is that SIGKILL is somehow ignored in favor for SIGSTOP and process (and all of its threads somehow end up suspended). As a side note, we changed our implementation to "post" SIGKILL to all threads of p1 because of another race we discovered. In this case the thread selected by psignal/tdsignal happened to be in thr_exit() on its way to dying. Becuse it was still on the list of available threads for the process, it was picked (FIRST_TD_IN_PROC) but because it was in thr_exit it dies taking SIGKILL with it. What I see in this new race is the following. We post SIGKILL on every thread of the process and c2 leaves releasing p2's proc_lock. As each thread returns to ring3 via the trap handler it sees that it has a signal to deal with and calls cursig and postsig. In the code, postsig eventually calls sigexit (default behavior) which via exit1 calls thread_suspend_check causing threads to kill themselves as long as the first thread that is here calls thread_single(SINGLE_EXIT). In our case, the process (which is still on the global all_proc list) is subjected to SIGSTOP which sets the P_STOPPED_SIG flag to p1. As each thread makes its way through thread_suspend_check it suspends itself becuase P_SHOULDSTOP ends up being true. In the end I end up with a process whose threads have taken SIGKILL (I can dump each threads state and look at its siglist to see no signals) but the process hasn't died. This seems odd. It would seem that any signals posted after the process receives a SIGKILL should be ignore but how do we detect that specially after SIGKILL is cleared from the siglist because it is in the middle of taking the signal. Alternatively if the signal being taken is SIGKILL the kernel needs to avoid saying "I'll stop the process now because I've been asked to". Any good solutions to this problem? Thanks Ravi Murty