From owner-freebsd-arch@FreeBSD.ORG  Fri Sep  7 16:35:54 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 5DDA0106564A;
	Fri,  7 Sep 2012 16:35:54 +0000 (UTC)
	(envelope-from yanegomi@gmail.com)
Received: from mail-ob0-f182.google.com (mail-ob0-f182.google.com
	[209.85.214.182])
	by mx1.freebsd.org (Postfix) with ESMTP id 0DA808FC0A;
	Fri,  7 Sep 2012 16:35:53 +0000 (UTC)
Received: by obbun3 with SMTP id un3so6245914obb.13
	for <multiple recipients>; Fri, 07 Sep 2012 09:35:52 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=mime-version:in-reply-to:references:date:message-id:subject:from:to
	:cc:content-type;
	bh=vG18D5HaLZacX9N297J/xlM4lTYwj78HmhbIIjCNiXQ=;
	b=tVmR1/DNdNr1Dsi/VN/krJdmUFFwOns6guLn0olheZ/x/fLq5jpz6VZSEMxlp3dDsW
	e7eeMIOCMfox/EMqlc20gVRYDVFOpY5YJchEg3VPTcfVWHQiTB47Wp8zwm/x/auKjvt2
	aHcwdny4wyKvo7I2LI3Bc+OZ7JE6uygghWBWp0PTjQ1jHuPd31zY2pNUpDSQugWpF2+a
	QFDQ/1KReOLrdoTXUrirGhmwE8Q30cawX3th/jNeDJLN5jagNGoj3HDgBIgpRfBKvoRC
	viShk5qAgzWXAYSp2Il9FSTlyF1UG9kXw4p9MHMjAamJhB4r0cehhItbeOXSXxUZW9ul
	RKng==
MIME-Version: 1.0
Received: by 10.60.20.69 with SMTP id l5mr6642057oee.114.1347035752616; Fri,
	07 Sep 2012 09:35:52 -0700 (PDT)
Received: by 10.76.142.201 with HTTP; Fri, 7 Sep 2012 09:35:52 -0700 (PDT)
In-Reply-To: <201209071217.47439.jhb@freebsd.org>
References: <201209071217.47439.jhb@freebsd.org>
Date: Fri, 7 Sep 2012 09:35:52 -0700
Message-ID: <CAGH67wRu91eOmB-cxWOkVDvLPZ-BpLs=GuYhwwbD9s697xpkFA@mail.gmail.com>
From: Garrett Cooper <yanegomi@gmail.com>
To: John Baldwin <jhb@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1
Cc: arch@freebsd.org
Subject: Re: [PATCH] Close a race between exit1() and SIGSTOP
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 07 Sep 2012 16:35:54 -0000

On Fri, Sep 7, 2012 at 9:17 AM, John Baldwin <jhb@freebsd.org> wrote:
> We ran into a bug at work recently where processes were reporting an exit
> status from wait(2) of WIFSIGNALED() with WTERMSIG() of SIGSTOP.  I narrowed
> this down to a race in exit1().  Specifically, exit1() sets p->p_xstat to the
> passed in exit status and sets P_WEXIT.  It then drops the PROC_LOCK() to do
> other work such as freeing file descriptors, etc.  Later it reacquires the
> PROC_LOCK(), marks the process as a zombie, and terminates.  During the window
> where the PROC_LOCK() is not held, if a stop signal arrives (SIGSTOP, SIGTSTP,
> etc.), then the signal code will overwrite p->p_xstat with the signal that
> initiated the stop.  The end result is that setting from SIGSTOP stays in
> p->p_xstat and is reported via wait(2).
>
> My first attempt at a fix was to simply ignore all signals once P_WEXIT is
> set by adding a check for P_WEXIT to the existing check for PRS_ZOMBIE.
> However, I think this is not quite right.  For example, if an exiting process
> gets hung on an interruptible NFS mount while doing fdfree() I think we want a
> user to be able to kill the process to let it break out of NFS and finish
> exiting.
>
> The second fix I have is to explicitly clear P_STOPPED_SIGNAL to discard any
> pending SIGSTOP when setting P_WEXIT and to explicitly disallow any stop
> signals for a process that is currently exiting (that is, P_WEXIT is set).
> This fixes the race I ran into while still allowing other signals during
> process exit.  The patch to do that is below.  Below that is a test program
> that reproduces the issue.
>
> I would really like to get some feedback on which approach is best and if my
> concerns about allowing other signals during exit1() is a legitimate concern.
> (For some reason I feel like I've seen an argument for allowing that before.)

Does this case continue to function normally?

$ sh -c 'while :; do sleep 1; done'
^Z
[1]+  Stopped                 sh -c 'while :; do sleep 1; done'
$ kill %1
$ fg
bash: fg: job has terminated
[1]+  Terminated: 15          sh -c 'while :; do sleep 1; done'

$ sh -c 'while :; do sleep 1; done'
^Z
[1]+  Stopped                 sh -c 'while :; do sleep 1; done'
$ kill %1
$ kill %1
bash: kill: (3522) - No such process
[1]+  Terminated: 15          sh -c 'while :; do sleep 1; done'

In particular, a single kill results in the signal being sent after
the process wakes up, and a double kill results in the immediate death
of the process (in this case a shell job) (in part because of the
signals /bin/sh masks).

Thanks!
-Garrett