From owner-freebsd-threads@FreeBSD.ORG  Thu Oct 21 20:58:40 2004
Return-Path: <owner-freebsd-threads@FreeBSD.ORG>
Delivered-To: freebsd-threads@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 7E3A816A4CE
	for <threads@FreeBSD.org>; Thu, 21 Oct 2004 20:58:40 +0000 (GMT)
Received: from mail6.speakeasy.net (mail6.speakeasy.net [216.254.0.206])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 34B2843D41
	for <threads@FreeBSD.org>; Thu, 21 Oct 2004 20:58:40 +0000 (GMT)
	(envelope-from jhb@FreeBSD.org)
Received: (qmail 25735 invoked from network); 21 Oct 2004 20:58:39 -0000
Received: from dsl027-160-063.atl1.dsl.speakeasy.net (HELO server.baldwin.cx)
	([216.27.160.63])          (envelope-sender <jhb@FreeBSD.org>)
	encrypted SMTP
	for <threads@FreeBSD.org>; 21 Oct 2004 20:58:39 -0000
Received: from [10.50.41.228] (gw1.twc.weather.com [216.133.140.1])
	(authenticated bits=0)
	by server.baldwin.cx (8.12.11/8.12.11) with ESMTP id i9LKwQnJ068720;
	Thu, 21 Oct 2004 16:58:35 -0400 (EDT)
	(envelope-from jhb@FreeBSD.org)
From: John Baldwin <jhb@FreeBSD.org>
To: Daniel Eischen <deischen@FreeBSD.org>
Date: Thu, 21 Oct 2004 12:54:22 -0400
User-Agent: KMail/1.6.2
References: <Pine.GSO.4.43.0410201733150.18398-100000@sea.ntplx.net>
In-Reply-To: <Pine.GSO.4.43.0410201733150.18398-100000@sea.ntplx.net>
MIME-Version: 1.0
Content-Disposition: inline
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <200410211254.22805.jhb@FreeBSD.org>
X-Spam-Checker-Version: SpamAssassin 2.63 (2004-01-11) on server.baldwin.cx
cc: threads@FreeBSD.org
Subject: Re: Infinite loop bug in libc_r on 4.x with condition variables and
	signals
X-BeenThere: freebsd-threads@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Threading on FreeBSD <freebsd-threads.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-threads>,
	<mailto:freebsd-threads-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-threads>
List-Post: <mailto:freebsd-threads@freebsd.org>
List-Help: <mailto:freebsd-threads-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-threads>,
	<mailto:freebsd-threads-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 21 Oct 2004 20:58:40 -0000

On Wednesday 20 October 2004 05:39 pm, Daniel Eischen wrote:
> On Wed, 20 Oct 2004, John Baldwin wrote:
> > We are trying to run mono on 4.x and are having problems with the process
> > getting stuck spinning in an infinite loop.  After some debugging, we
> > determined that the problem is that the condition variable thread queues
> > are getting corrupted due to threads being added to a queue while they
> > are already queued on another queue.  For example, if a thread is somehow
> > on c1's queue but runs and blocks on c2, later when c1 tries to do a
> > broadcast, it tries to remove all the waiters to wake them up doing
> > something like:
> >
> > 	while ((head = TAILQ_FIRST(&c1->c_queue)) != NULL) {
> > 	}
> >
> > The problem is that since the thread was last added to c2's queue, his
> > tqe_prev pointer in his sqe TAILQ_ENTRY points to an item on c2's list,
> > and thus the c_queue.tqe_next pointer doesn't get updated by
> > TAILQ_REMOVE, so the thread just "sticks" on c1's head pointer and it
> > spins forever.
> >
> > We seemed to have tracked this down to some sort of bug related to
> > signals and condition variables.  It seems that we try to go handle a
> > signal while we are on a condition variable queue, but not in
> > PS_COND_WAIT, so
> > _cond_wait_backout() is not called to remove the thread from the queue. 
> > I tried deferring signals around the cond queue manipulations in
> > cond_wait() and cond_timedwait() but we are still seeing the problem. 
> > The patches we currently are using (including debug cruft) are below. 
> > Right now we see the assertion in _thread_sig_wrapper() firing, but if I
> > remove that, one of the assertions in the condition variable code that
> > check for threads not being on the right condition variable queue trigger
> > instead.  Does anyone have any other ideas of how a thread could catch a
> > signal while PS_RUNNING and on a condition variable queue?  (I'm also
> > worried that the wait() functions assume that if the thread is
> > interrupted, its always not on the queue, but that doesn't seem to be the
> > case for pthread_cancel() for example.)
>
> I'm not sure what's going on, but I do know that you can't call
> pthread_cond_wait() from a signal handler.  If a thread is blocked
> on (taking your example) condition variable c1, then a signal
> interrupts it and it again blocks on condition variable c2, that
> behavior is undefined (by POSIX).

The behavior seems more to be this:

- thread does pthread_cond_wait*(c1)
- thread enqueued on c1
- thread interrupted by a signal while on c1 but still in PS_RUNNING
- thread saves state which excludes the PTHREAD_FLAGS_IN_CONDQ flag (among
  others)
- thread calls _cond_wait_backout() if state is PS_COND_WAIT (but it's not in 
- this case, this is the normal case though, which is why it's ok to not save
  the CONDQ flag in the saved state above)
- thread executes signal handler
- thread restores state
- pthread_condwait*() see that interrupted is 0, so don't try to remove the 
thread from the condition variable (also, PTHREAD_FLAGS_IN_CONDQ isn't set 
either, so we can't detect this case that way)
- thread returns from pthread_cond_wait() (maybe due to timeout, etc.)
- thread calls pthread_cond_wait*(c2)
- thread enqueued on c2
- another thread does pthread_cond_broadcast(c2), and bewm

My question is is it possible for the thread to get interrupted and chosen to 
run a signal while it is on c1 somehow given my patch to defer signals around 
the wait loops (and is that patch correct btw given the above scenario?)

> Another thing to watch out for is longjmps out of signal handlers
> after being interrupted while waiting on a condition variable.
> I think libc_r should handle this, but there could be a bug
> lurking in that respect.

The thing to note is that my assertion in _thread_sig_wrapper() about being on 
a condition variable queue and executing a handler is that it is placed after 
_cond_wait_backout() could be called (but won't be for PS_RUNNING), and 
before the signal handler itself is called.

> I'll take a look at libc_r and see if I can spot anything obvious.

Ok, thanks.  FWIW, it seems that on 5.3 with KSE, mono does much better, but 
we still see rare hangs, so it maybe that if this bug is fixed it might be 
present in libpthread on 5 as well.

-- 
John Baldwin <jhb@FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
"Power Users Use the Power to Serve"  =  http://www.FreeBSD.org