Date: Tue, 19 Apr 2005 12:38:24 +0800 From: David Xu <davidxu@freebsd.org> To: Peter Edwards <peadar.edwards@gmail.com> Cc: FreeBSD Current <current@freebsd.org> Subject: Re: Race condition in debugger? Message-ID: <42648B40.6040701@freebsd.org> In-Reply-To: <34cb7c8405041717342891f2@mail.gmail.com> References: <20050214014217.GB85932@wantadilla.lemis.com> <34cb7c8405041717342891f2@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Peter Edwards wrote: >[Very late response: I just experienced the same problem and >remembered the issue had been brought up before] > >On 2/14/05, Greg 'groggy' Lehey <grog@freebsd.org> wrote: > > >>I'm having some problems with userland gdb on recent -CURRENT builds: >>at some point it hangs. >> >>Specifically, I'm setting a conditional breakpoint like this: >> >> b Minsert_blockletpointer if I->inode_num == 0x1f0bb >> >>inode_num increments for 1, so I hit this breakpoint about 100,000 >>times. Or I should. What happens is that the debugger hangs at some >>point on the way. ktrace shows multiple copies of: >> >> 12325 gdb CALL ptrace(12,0x3026,0xbfbfd5e0,0) >> 12325 gdb RET ptrace 0 >> 12325 gdb CALL ptrace(PT_STEP,0x3026,0x1,0) >> 12325 gdb RET ptrace 0 >> 12325 gdb CALL wait4(0xffffffff,0xbfbfd808,0,0) <-- stops here >> 12325 gdb RET wait4 12326/0x3026 >> 12325 gdb CALL kill(0x3026,0) >> 12325 gdb RET kill 0 >> 12325 gdb CALL ptrace(PT_GETREGS,0x3026,0xbfbfd5c0,0) >> >>When it hangs, it's at the call to wait4, as shown. It looks like the >>completion of the ptrace request isn't being reported back. >> >> > >I think I know what's going on with this, and I have a feeling that >there's a couple of other wait()-related issues that were left open on >the lists that might be explained by the issue. > >Here's my hypothesis: kern_wait() checks each child of the current >process to see if they have exited, or should otherwise report status >to wait/wait3/wait4/waitpid, If it finds that all candidate children >have nothing to report, it goes asleep, waiting to be awoken by the/a >child reporting status, and repeats the process: it looks a bit like >this: > >kern_wait() >{ >loop: > foreach child of self { > if (child has status to report) > return status; > } > lock self > msleep(on "self") > unlock self > goto loop; >} > >Problem is, that there's no lock protecting that the conditions in the >inner loop hold by the time the current process locks its own "struct >proc" and invokes msleep(). (It's probably most likely the race will >happen on an SMP machine or with PREEMPTION, but the aquiry of >curproc's lock could possibly cause the issue if it needed to sleep.), >i.e., you can miss the wakeup generated by a particular child between >checking the process in the inner loop, and going to sleep. > >I can at least reproduce this for the ptrace/gdb case, but AFAICT, it >could happen for the standard wait()/exit() path, too. I worked up a >patch to fix the problem by having those parts of the kernel that wake >the process up flag the fact in the parent's flags and doing the >wakeup while holding tha parent process lock, and noticing if this >flag has been set before sleeping. (A simpler solution would be to >hold the parent lock across the bulk of kern_wait, but from what I can >gather this will lead to at least one LOR) > >I've been unable to reproduce the problem with a kernel with this >patch, and using a nice sprinkling of printfs can show that when GDB >hangs, the race has just occurred. > >Anyone got opinions on this? >Cheers, >Peadar. > > I just found another case that if the parent masks SIGCHLD, then we will get the race too. I have tested the patch, it works, I will tweak the patch and commit it soon. David Xu
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?42648B40.6040701>