From owner-freebsd-arch@FreeBSD.ORG Fri Sep 7 16:35:54 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 5DDA0106564A; Fri, 7 Sep 2012 16:35:54 +0000 (UTC) (envelope-from yanegomi@gmail.com) Received: from mail-ob0-f182.google.com (mail-ob0-f182.google.com [209.85.214.182]) by mx1.freebsd.org (Postfix) with ESMTP id 0DA808FC0A; Fri, 7 Sep 2012 16:35:53 +0000 (UTC) Received: by obbun3 with SMTP id un3so6245914obb.13 for ; Fri, 07 Sep 2012 09:35:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=vG18D5HaLZacX9N297J/xlM4lTYwj78HmhbIIjCNiXQ=; b=tVmR1/DNdNr1Dsi/VN/krJdmUFFwOns6guLn0olheZ/x/fLq5jpz6VZSEMxlp3dDsW e7eeMIOCMfox/EMqlc20gVRYDVFOpY5YJchEg3VPTcfVWHQiTB47Wp8zwm/x/auKjvt2 aHcwdny4wyKvo7I2LI3Bc+OZ7JE6uygghWBWp0PTjQ1jHuPd31zY2pNUpDSQugWpF2+a QFDQ/1KReOLrdoTXUrirGhmwE8Q30cawX3th/jNeDJLN5jagNGoj3HDgBIgpRfBKvoRC viShk5qAgzWXAYSp2Il9FSTlyF1UG9kXw4p9MHMjAamJhB4r0cehhItbeOXSXxUZW9ul RKng== MIME-Version: 1.0 Received: by 10.60.20.69 with SMTP id l5mr6642057oee.114.1347035752616; Fri, 07 Sep 2012 09:35:52 -0700 (PDT) Received: by 10.76.142.201 with HTTP; Fri, 7 Sep 2012 09:35:52 -0700 (PDT) In-Reply-To: <201209071217.47439.jhb@freebsd.org> References: <201209071217.47439.jhb@freebsd.org> Date: Fri, 7 Sep 2012 09:35:52 -0700 Message-ID: From: Garrett Cooper To: John Baldwin Content-Type: text/plain; charset=ISO-8859-1 Cc: arch@freebsd.org Subject: Re: [PATCH] Close a race between exit1() and SIGSTOP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 07 Sep 2012 16:35:54 -0000 On Fri, Sep 7, 2012 at 9:17 AM, John Baldwin wrote: > We ran into a bug at work recently where processes were reporting an exit > status from wait(2) of WIFSIGNALED() with WTERMSIG() of SIGSTOP. I narrowed > this down to a race in exit1(). Specifically, exit1() sets p->p_xstat to the > passed in exit status and sets P_WEXIT. It then drops the PROC_LOCK() to do > other work such as freeing file descriptors, etc. Later it reacquires the > PROC_LOCK(), marks the process as a zombie, and terminates. During the window > where the PROC_LOCK() is not held, if a stop signal arrives (SIGSTOP, SIGTSTP, > etc.), then the signal code will overwrite p->p_xstat with the signal that > initiated the stop. The end result is that setting from SIGSTOP stays in > p->p_xstat and is reported via wait(2). > > My first attempt at a fix was to simply ignore all signals once P_WEXIT is > set by adding a check for P_WEXIT to the existing check for PRS_ZOMBIE. > However, I think this is not quite right. For example, if an exiting process > gets hung on an interruptible NFS mount while doing fdfree() I think we want a > user to be able to kill the process to let it break out of NFS and finish > exiting. > > The second fix I have is to explicitly clear P_STOPPED_SIGNAL to discard any > pending SIGSTOP when setting P_WEXIT and to explicitly disallow any stop > signals for a process that is currently exiting (that is, P_WEXIT is set). > This fixes the race I ran into while still allowing other signals during > process exit. The patch to do that is below. Below that is a test program > that reproduces the issue. > > I would really like to get some feedback on which approach is best and if my > concerns about allowing other signals during exit1() is a legitimate concern. > (For some reason I feel like I've seen an argument for allowing that before.) Does this case continue to function normally? $ sh -c 'while :; do sleep 1; done' ^Z [1]+ Stopped sh -c 'while :; do sleep 1; done' $ kill %1 $ fg bash: fg: job has terminated [1]+ Terminated: 15 sh -c 'while :; do sleep 1; done' $ sh -c 'while :; do sleep 1; done' ^Z [1]+ Stopped sh -c 'while :; do sleep 1; done' $ kill %1 $ kill %1 bash: kill: (3522) - No such process [1]+ Terminated: 15 sh -c 'while :; do sleep 1; done' In particular, a single kill results in the signal being sent after the process wakes up, and a double kill results in the immediate death of the process (in this case a shell job) (in part because of the signals /bin/sh masks). Thanks! -Garrett