Date: Fri, 7 Sep 2012 09:35:52 -0700 From: Garrett Cooper <yanegomi@gmail.com> To: John Baldwin <jhb@freebsd.org> Cc: arch@freebsd.org Subject: Re: [PATCH] Close a race between exit1() and SIGSTOP Message-ID: <CAGH67wRu91eOmB-cxWOkVDvLPZ-BpLs=GuYhwwbD9s697xpkFA@mail.gmail.com> In-Reply-To: <201209071217.47439.jhb@freebsd.org> References: <201209071217.47439.jhb@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Sep 7, 2012 at 9:17 AM, John Baldwin <jhb@freebsd.org> wrote: > We ran into a bug at work recently where processes were reporting an exit > status from wait(2) of WIFSIGNALED() with WTERMSIG() of SIGSTOP. I narrowed > this down to a race in exit1(). Specifically, exit1() sets p->p_xstat to the > passed in exit status and sets P_WEXIT. It then drops the PROC_LOCK() to do > other work such as freeing file descriptors, etc. Later it reacquires the > PROC_LOCK(), marks the process as a zombie, and terminates. During the window > where the PROC_LOCK() is not held, if a stop signal arrives (SIGSTOP, SIGTSTP, > etc.), then the signal code will overwrite p->p_xstat with the signal that > initiated the stop. The end result is that setting from SIGSTOP stays in > p->p_xstat and is reported via wait(2). > > My first attempt at a fix was to simply ignore all signals once P_WEXIT is > set by adding a check for P_WEXIT to the existing check for PRS_ZOMBIE. > However, I think this is not quite right. For example, if an exiting process > gets hung on an interruptible NFS mount while doing fdfree() I think we want a > user to be able to kill the process to let it break out of NFS and finish > exiting. > > The second fix I have is to explicitly clear P_STOPPED_SIGNAL to discard any > pending SIGSTOP when setting P_WEXIT and to explicitly disallow any stop > signals for a process that is currently exiting (that is, P_WEXIT is set). > This fixes the race I ran into while still allowing other signals during > process exit. The patch to do that is below. Below that is a test program > that reproduces the issue. > > I would really like to get some feedback on which approach is best and if my > concerns about allowing other signals during exit1() is a legitimate concern. > (For some reason I feel like I've seen an argument for allowing that before.) Does this case continue to function normally? $ sh -c 'while :; do sleep 1; done' ^Z [1]+ Stopped sh -c 'while :; do sleep 1; done' $ kill %1 $ fg bash: fg: job has terminated [1]+ Terminated: 15 sh -c 'while :; do sleep 1; done' $ sh -c 'while :; do sleep 1; done' ^Z [1]+ Stopped sh -c 'while :; do sleep 1; done' $ kill %1 $ kill %1 bash: kill: (3522) - No such process [1]+ Terminated: 15 sh -c 'while :; do sleep 1; done' In particular, a single kill results in the signal being sent after the process wakes up, and a double kill results in the immediate death of the process (in this case a shell job) (in part because of the signals /bin/sh masks). Thanks! -Garrett
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAGH67wRu91eOmB-cxWOkVDvLPZ-BpLs=GuYhwwbD9s697xpkFA>