Date: Thu, 21 Oct 2004 16:15:08 -0500 From: "Jeremy Messenger" <mezz7@cox.net> To: "John Baldwin" <jhb@freebsd.org> Cc: threads@freebsd.org Subject: Re: Infinite loop bug in libc_r on 4.x with condition variables and signals Message-ID: <opsf8nzizy9aq2h7@mezz.mezzweb.com> In-Reply-To: <200410211254.22805.jhb@FreeBSD.org> References: <Pine.GSO.4.43.0410201733150.18398-100000@sea.ntplx.net> <200410211254.22805.jhb@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 21 Oct 2004 12:54:22 -0400, John Baldwin <jhb@FreeBSD.org> wrote: > On Wednesday 20 October 2004 05:39 pm, Daniel Eischen wrote: >> On Wed, 20 Oct 2004, John Baldwin wrote: >> > We are trying to run mono on 4.x and are having problems with the >> process >> > getting stuck spinning in an infinite loop. After some debugging, we >> > determined that the problem is that the condition variable thread >> queues >> > are getting corrupted due to threads being added to a queue while they >> > are already queued on another queue. For example, if a thread is >> somehow >> > on c1's queue but runs and blocks on c2, later when c1 tries to do a >> > broadcast, it tries to remove all the waiters to wake them up doing >> > something like: >> > >> > while ((head = TAILQ_FIRST(&c1->c_queue)) != NULL) { >> > } >> > >> > The problem is that since the thread was last added to c2's queue, his >> > tqe_prev pointer in his sqe TAILQ_ENTRY points to an item on c2's >> list, >> > and thus the c_queue.tqe_next pointer doesn't get updated by >> > TAILQ_REMOVE, so the thread just "sticks" on c1's head pointer and it >> > spins forever. >> > >> > We seemed to have tracked this down to some sort of bug related to >> > signals and condition variables. It seems that we try to go handle a >> > signal while we are on a condition variable queue, but not in >> > PS_COND_WAIT, so >> > _cond_wait_backout() is not called to remove the thread from the >> queue. >> > I tried deferring signals around the cond queue manipulations in >> > cond_wait() and cond_timedwait() but we are still seeing the problem. >> > The patches we currently are using (including debug cruft) are below. >> > Right now we see the assertion in _thread_sig_wrapper() firing, but >> if I >> > remove that, one of the assertions in the condition variable code that >> > check for threads not being on the right condition variable queue >> trigger >> > instead. Does anyone have any other ideas of how a thread could >> catch a >> > signal while PS_RUNNING and on a condition variable queue? (I'm also >> > worried that the wait() functions assume that if the thread is >> > interrupted, its always not on the queue, but that doesn't seem to be >> the >> > case for pthread_cancel() for example.) >> >> I'm not sure what's going on, but I do know that you can't call >> pthread_cond_wait() from a signal handler. If a thread is blocked >> on (taking your example) condition variable c1, then a signal >> interrupts it and it again blocks on condition variable c2, that >> behavior is undefined (by POSIX). > > The behavior seems more to be this: > > - thread does pthread_cond_wait*(c1) > - thread enqueued on c1 > - thread interrupted by a signal while on c1 but still in PS_RUNNING > - thread saves state which excludes the PTHREAD_FLAGS_IN_CONDQ flag > (among > others) > - thread calls _cond_wait_backout() if state is PS_COND_WAIT (but it's > not in > - this case, this is the normal case though, which is why it's ok to not > save > the CONDQ flag in the saved state above) > - thread executes signal handler > - thread restores state > - pthread_condwait*() see that interrupted is 0, so don't try to remove > the > thread from the condition variable (also, PTHREAD_FLAGS_IN_CONDQ isn't > set > either, so we can't detect this case that way) > - thread returns from pthread_cond_wait() (maybe due to timeout, etc.) > - thread calls pthread_cond_wait*(c2) > - thread enqueued on c2 > - another thread does pthread_cond_broadcast(c2), and bewm > > My question is is it possible for the thread to get interrupted and > chosen to > run a signal while it is on c1 somehow given my patch to defer signals > around > the wait loops (and is that patch correct btw given the above scenario?) > >> Another thing to watch out for is longjmps out of signal handlers >> after being interrupted while waiting on a condition variable. >> I think libc_r should handle this, but there could be a bug >> lurking in that respect. > > The thing to note is that my assertion in _thread_sig_wrapper() about > being on > a condition variable queue and executing a handler is that it is placed > after > _cond_wait_backout() could be called (but won't be for PS_RUNNING), and > before the signal handler itself is called. > >> I'll take a look at libc_r and see if I can spot anything obvious. > > Ok, thanks. FWIW, it seems that on 5.3 with KSE, mono does much better, > but > we still see rare hangs, so it maybe that if this bug is fixed it might > be > present in libpthread on 5 as well. You can check this thread if you are insteresting... It's not about libc_r, but about Mono runs on FreeBSD 5.3 and the threads get corrupt if you run 'mono -pkg:foopkg foo.cs'. http://lists.freebsd.org/pipermail/freebsd-threads/2004-October/thread.html#2540 If you know the other fixes, secrets and etc, it would be nice if you can info to the bsd-sharp project[1]. Tom is kind of take it over for now while the maintainer of lang/mono is busy or has disappeared. Mono works better in bsd-sharp's lang/mono than FreeBSD's lang/mono. [1] http://forge.novell.com/modules/xfmod/project/?bsd-sharp Cheers, Mezz -- mezz7@cox.net - mezz@FreeBSD.org FreeBSD GNOME Team http://www.FreeBSD.org/gnome/ - gnome@FreeBSD.org
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?opsf8nzizy9aq2h7>