From owner-freebsd-threads@FreeBSD.ORG Thu Oct 21 21:15:04 2004 Return-Path: Delivered-To: freebsd-threads@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 5FE5C16A4CE; Thu, 21 Oct 2004 21:15:04 +0000 (GMT) Received: from lakermmtao12.cox.net (lakermmtao12.cox.net [68.230.240.27]) by mx1.FreeBSD.org (Postfix) with ESMTP id B377F43D1F; Thu, 21 Oct 2004 21:15:03 +0000 (GMT) (envelope-from mezz7@cox.net) Received: from mezz.mezzweb.com ([68.103.32.140]) by lakermmtao12.cox.net (InterMail vM.6.01.03.04 201-2131-111-106-20040729) with ESMTP id <20041021211502.HSRZ13338.lakermmtao12.cox.net@mezz.mezzweb.com>; Thu, 21 Oct 2004 17:15:02 -0400 Date: Thu, 21 Oct 2004 16:15:08 -0500 To: "John Baldwin" References: <200410211254.22805.jhb@FreeBSD.org> From: "Jeremy Messenger" Content-Type: text/plain; format=flowed; delsp=yes; charset=us-ascii MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Message-ID: In-Reply-To: <200410211254.22805.jhb@FreeBSD.org> User-Agent: Opera M2/7.54 (Linux, build 751) cc: Daniel Eischen cc: threads@freebsd.org Subject: Re: Infinite loop bug in libc_r on 4.x with condition variables and signals X-BeenThere: freebsd-threads@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Threading on FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 21 Oct 2004 21:15:04 -0000 On Thu, 21 Oct 2004 12:54:22 -0400, John Baldwin wrote: > On Wednesday 20 October 2004 05:39 pm, Daniel Eischen wrote: >> On Wed, 20 Oct 2004, John Baldwin wrote: >> > We are trying to run mono on 4.x and are having problems with the >> process >> > getting stuck spinning in an infinite loop. After some debugging, we >> > determined that the problem is that the condition variable thread >> queues >> > are getting corrupted due to threads being added to a queue while they >> > are already queued on another queue. For example, if a thread is >> somehow >> > on c1's queue but runs and blocks on c2, later when c1 tries to do a >> > broadcast, it tries to remove all the waiters to wake them up doing >> > something like: >> > >> > while ((head = TAILQ_FIRST(&c1->c_queue)) != NULL) { >> > } >> > >> > The problem is that since the thread was last added to c2's queue, his >> > tqe_prev pointer in his sqe TAILQ_ENTRY points to an item on c2's >> list, >> > and thus the c_queue.tqe_next pointer doesn't get updated by >> > TAILQ_REMOVE, so the thread just "sticks" on c1's head pointer and it >> > spins forever. >> > >> > We seemed to have tracked this down to some sort of bug related to >> > signals and condition variables. It seems that we try to go handle a >> > signal while we are on a condition variable queue, but not in >> > PS_COND_WAIT, so >> > _cond_wait_backout() is not called to remove the thread from the >> queue. >> > I tried deferring signals around the cond queue manipulations in >> > cond_wait() and cond_timedwait() but we are still seeing the problem. >> > The patches we currently are using (including debug cruft) are below. >> > Right now we see the assertion in _thread_sig_wrapper() firing, but >> if I >> > remove that, one of the assertions in the condition variable code that >> > check for threads not being on the right condition variable queue >> trigger >> > instead. Does anyone have any other ideas of how a thread could >> catch a >> > signal while PS_RUNNING and on a condition variable queue? (I'm also >> > worried that the wait() functions assume that if the thread is >> > interrupted, its always not on the queue, but that doesn't seem to be >> the >> > case for pthread_cancel() for example.) >> >> I'm not sure what's going on, but I do know that you can't call >> pthread_cond_wait() from a signal handler. If a thread is blocked >> on (taking your example) condition variable c1, then a signal >> interrupts it and it again blocks on condition variable c2, that >> behavior is undefined (by POSIX). > > The behavior seems more to be this: > > - thread does pthread_cond_wait*(c1) > - thread enqueued on c1 > - thread interrupted by a signal while on c1 but still in PS_RUNNING > - thread saves state which excludes the PTHREAD_FLAGS_IN_CONDQ flag > (among > others) > - thread calls _cond_wait_backout() if state is PS_COND_WAIT (but it's > not in > - this case, this is the normal case though, which is why it's ok to not > save > the CONDQ flag in the saved state above) > - thread executes signal handler > - thread restores state > - pthread_condwait*() see that interrupted is 0, so don't try to remove > the > thread from the condition variable (also, PTHREAD_FLAGS_IN_CONDQ isn't > set > either, so we can't detect this case that way) > - thread returns from pthread_cond_wait() (maybe due to timeout, etc.) > - thread calls pthread_cond_wait*(c2) > - thread enqueued on c2 > - another thread does pthread_cond_broadcast(c2), and bewm > > My question is is it possible for the thread to get interrupted and > chosen to > run a signal while it is on c1 somehow given my patch to defer signals > around > the wait loops (and is that patch correct btw given the above scenario?) > >> Another thing to watch out for is longjmps out of signal handlers >> after being interrupted while waiting on a condition variable. >> I think libc_r should handle this, but there could be a bug >> lurking in that respect. > > The thing to note is that my assertion in _thread_sig_wrapper() about > being on > a condition variable queue and executing a handler is that it is placed > after > _cond_wait_backout() could be called (but won't be for PS_RUNNING), and > before the signal handler itself is called. > >> I'll take a look at libc_r and see if I can spot anything obvious. > > Ok, thanks. FWIW, it seems that on 5.3 with KSE, mono does much better, > but > we still see rare hangs, so it maybe that if this bug is fixed it might > be > present in libpthread on 5 as well. You can check this thread if you are insteresting... It's not about libc_r, but about Mono runs on FreeBSD 5.3 and the threads get corrupt if you run 'mono -pkg:foopkg foo.cs'. http://lists.freebsd.org/pipermail/freebsd-threads/2004-October/thread.html#2540 If you know the other fixes, secrets and etc, it would be nice if you can info to the bsd-sharp project[1]. Tom is kind of take it over for now while the maintainer of lang/mono is busy or has disappeared. Mono works better in bsd-sharp's lang/mono than FreeBSD's lang/mono. [1] http://forge.novell.com/modules/xfmod/project/?bsd-sharp Cheers, Mezz -- mezz7@cox.net - mezz@FreeBSD.org FreeBSD GNOME Team http://www.FreeBSD.org/gnome/ - gnome@FreeBSD.org