From owner-freebsd-threads@FreeBSD.ORG  Thu Oct 21 21:15:04 2004
Return-Path: <owner-freebsd-threads@FreeBSD.ORG>
Delivered-To: freebsd-threads@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id 5FE5C16A4CE; Thu, 21 Oct 2004 21:15:04 +0000 (GMT)
Received: from lakermmtao12.cox.net (lakermmtao12.cox.net [68.230.240.27])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id B377F43D1F; Thu, 21 Oct 2004 21:15:03 +0000 (GMT)
	(envelope-from mezz7@cox.net)
Received: from mezz.mezzweb.com ([68.103.32.140]) by lakermmtao12.cox.net
          (InterMail vM.6.01.03.04 201-2131-111-106-20040729) with ESMTP
          id <20041021211502.HSRZ13338.lakermmtao12.cox.net@mezz.mezzweb.com>;
          Thu, 21 Oct 2004 17:15:02 -0400
Date: Thu, 21 Oct 2004 16:15:08 -0500
To: "John Baldwin" <jhb@freebsd.org>
References: <Pine.GSO.4.43.0410201733150.18398-100000@sea.ntplx.net>
	<200410211254.22805.jhb@FreeBSD.org>
From: "Jeremy Messenger" <mezz7@cox.net>
Content-Type: text/plain; format=flowed; delsp=yes; charset=us-ascii
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Message-ID: <opsf8nzizy9aq2h7@mezz.mezzweb.com>
In-Reply-To: <200410211254.22805.jhb@FreeBSD.org>
User-Agent: Opera M2/7.54 (Linux, build 751)
cc: Daniel Eischen <deischen@freebsd.org>
cc: threads@freebsd.org
Subject: Re: Infinite loop bug in libc_r on 4.x with condition variables and
	signals
X-BeenThere: freebsd-threads@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Threading on FreeBSD <freebsd-threads.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-threads>,
	<mailto:freebsd-threads-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-threads>
List-Post: <mailto:freebsd-threads@freebsd.org>
List-Help: <mailto:freebsd-threads-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-threads>,
	<mailto:freebsd-threads-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 21 Oct 2004 21:15:04 -0000

On Thu, 21 Oct 2004 12:54:22 -0400, John Baldwin <jhb@FreeBSD.org> wrote:

> On Wednesday 20 October 2004 05:39 pm, Daniel Eischen wrote:
>> On Wed, 20 Oct 2004, John Baldwin wrote:
>> > We are trying to run mono on 4.x and are having problems with the  
>> process
>> > getting stuck spinning in an infinite loop.  After some debugging, we
>> > determined that the problem is that the condition variable thread  
>> queues
>> > are getting corrupted due to threads being added to a queue while they
>> > are already queued on another queue.  For example, if a thread is  
>> somehow
>> > on c1's queue but runs and blocks on c2, later when c1 tries to do a
>> > broadcast, it tries to remove all the waiters to wake them up doing
>> > something like:
>> >
>> > 	while ((head = TAILQ_FIRST(&c1->c_queue)) != NULL) {
>> > 	}
>> >
>> > The problem is that since the thread was last added to c2's queue, his
>> > tqe_prev pointer in his sqe TAILQ_ENTRY points to an item on c2's  
>> list,
>> > and thus the c_queue.tqe_next pointer doesn't get updated by
>> > TAILQ_REMOVE, so the thread just "sticks" on c1's head pointer and it
>> > spins forever.
>> >
>> > We seemed to have tracked this down to some sort of bug related to
>> > signals and condition variables.  It seems that we try to go handle a
>> > signal while we are on a condition variable queue, but not in
>> > PS_COND_WAIT, so
>> > _cond_wait_backout() is not called to remove the thread from the  
>> queue.
>> > I tried deferring signals around the cond queue manipulations in
>> > cond_wait() and cond_timedwait() but we are still seeing the problem.
>> > The patches we currently are using (including debug cruft) are below.
>> > Right now we see the assertion in _thread_sig_wrapper() firing, but  
>> if I
>> > remove that, one of the assertions in the condition variable code that
>> > check for threads not being on the right condition variable queue  
>> trigger
>> > instead.  Does anyone have any other ideas of how a thread could  
>> catch a
>> > signal while PS_RUNNING and on a condition variable queue?  (I'm also
>> > worried that the wait() functions assume that if the thread is
>> > interrupted, its always not on the queue, but that doesn't seem to be  
>> the
>> > case for pthread_cancel() for example.)
>>
>> I'm not sure what's going on, but I do know that you can't call
>> pthread_cond_wait() from a signal handler.  If a thread is blocked
>> on (taking your example) condition variable c1, then a signal
>> interrupts it and it again blocks on condition variable c2, that
>> behavior is undefined (by POSIX).
>
> The behavior seems more to be this:
>
> - thread does pthread_cond_wait*(c1)
> - thread enqueued on c1
> - thread interrupted by a signal while on c1 but still in PS_RUNNING
> - thread saves state which excludes the PTHREAD_FLAGS_IN_CONDQ flag  
> (among
>   others)
> - thread calls _cond_wait_backout() if state is PS_COND_WAIT (but it's  
> not in
> - this case, this is the normal case though, which is why it's ok to not  
> save
>   the CONDQ flag in the saved state above)
> - thread executes signal handler
> - thread restores state
> - pthread_condwait*() see that interrupted is 0, so don't try to remove  
> the
> thread from the condition variable (also, PTHREAD_FLAGS_IN_CONDQ isn't  
> set
> either, so we can't detect this case that way)
> - thread returns from pthread_cond_wait() (maybe due to timeout, etc.)
> - thread calls pthread_cond_wait*(c2)
> - thread enqueued on c2
> - another thread does pthread_cond_broadcast(c2), and bewm
>
> My question is is it possible for the thread to get interrupted and  
> chosen to
> run a signal while it is on c1 somehow given my patch to defer signals  
> around
> the wait loops (and is that patch correct btw given the above scenario?)
>
>> Another thing to watch out for is longjmps out of signal handlers
>> after being interrupted while waiting on a condition variable.
>> I think libc_r should handle this, but there could be a bug
>> lurking in that respect.
>
> The thing to note is that my assertion in _thread_sig_wrapper() about  
> being on
> a condition variable queue and executing a handler is that it is placed  
> after
> _cond_wait_backout() could be called (but won't be for PS_RUNNING), and
> before the signal handler itself is called.
>
>> I'll take a look at libc_r and see if I can spot anything obvious.
>
> Ok, thanks.  FWIW, it seems that on 5.3 with KSE, mono does much better,  
> but
> we still see rare hangs, so it maybe that if this bug is fixed it might  
> be
> present in libpthread on 5 as well.

You can check this thread if you are insteresting... It's not about  
libc_r, but about Mono runs on FreeBSD 5.3 and the threads get corrupt if  
you run 'mono -pkg:foopkg foo.cs'.

http://lists.freebsd.org/pipermail/freebsd-threads/2004-October/thread.html#2540

If you know the other fixes, secrets and etc, it would be nice if you can  
info to the bsd-sharp project[1]. Tom is kind of take it over for now  
while the maintainer of lang/mono is busy or has disappeared. Mono works  
better in bsd-sharp's lang/mono than FreeBSD's lang/mono.

[1] http://forge.novell.com/modules/xfmod/project/?bsd-sharp

Cheers,
Mezz


-- 
mezz7@cox.net  -  mezz@FreeBSD.org
FreeBSD GNOME Team
http://www.FreeBSD.org/gnome/  -  gnome@FreeBSD.org