From owner-freebsd-threads@FreeBSD.ORG Thu Oct 28 20:41:44 2004 Return-Path: Delivered-To: freebsd-threads@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 0BEBB16A4CE; Thu, 28 Oct 2004 20:41:44 +0000 (GMT) Received: from lakermmtao07.cox.net (lakermmtao07.cox.net [68.230.240.32]) by mx1.FreeBSD.org (Postfix) with ESMTP id 47BD243D31; Thu, 28 Oct 2004 20:41:43 +0000 (GMT) (envelope-from mezz7@cox.net) Received: from mezz.mezzweb.com ([68.103.32.140]) by lakermmtao07.cox.net (InterMail vM.6.01.03.04 201-2131-111-106-20040729) with ESMTP id <20041028204136.KIBN22771.lakermmtao07.cox.net@mezz.mezzweb.com>; Thu, 28 Oct 2004 16:41:36 -0400 Date: Thu, 28 Oct 2004 15:42:00 -0500 To: "Daniel Eischen" References: From: "Jeremy Messenger" Content-Type: text/plain; format=flowed; delsp=yes; charset=us-ascii MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Message-ID: In-Reply-To: User-Agent: Opera M2/7.54 (Linux, build 751) cc: threads@freebsd.org cc: John Baldwin Subject: Re: Infinite loop bug in libc_r on 4.x with condition variables and signals X-BeenThere: freebsd-threads@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Threading on FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 28 Oct 2004 20:41:44 -0000 On Wed, 27 Oct 2004 18:30:15 -0400 (EDT), Daniel Eischen wrote: > On Wed, 27 Oct 2004, John Baldwin wrote: > >> On Thursday 21 October 2004 07:04 pm, Daniel Eischen wrote: >> > On Thu, 21 Oct 2004, John Baldwin wrote: >> > > The behavior seems more to be this: >> > > >> > > - thread does pthread_cond_wait*(c1) >> > > - thread enqueued on c1 >> > > - thread interrupted by a signal while on c1 but still in PS_RUNNING >> > >> > This shouldn't happen when signals are deferred. It should >> > only happen when the state is PS_COND_WAIT after we've >> > context switched to the scheduler. >> > >> > > - thread saves state which excludes the PTHREAD_FLAGS_IN_CONDQ flag >> > > (among others) >> > >> > Right, because it assumes that the thread will be backed out of >> > any mutex or CV queues prior to invoking the signal handler. >> > >> > > - thread calls _cond_wait_backout() if state is PS_COND_WAIT (but >> it's >> > > not in - this case, this is the normal case though, which is why >> it's ok >> > > to not save the CONDQ flag in the saved state above) >> > >> > Right. The problem is, how is the thread getting setup for a signal >> > while signals are deferred and the state has not yet been changed >> > from PS_RUNNING to PS_COND_WAIT? >> > >> > > - thread executes signal handler >> > > - thread restores state >> > > - pthread_condwait*() see that interrupted is 0, so don't try to >> remove >> > > the thread from the condition variable (also, PTHREAD_FLAGS_IN_CONDQ >> > > isn't set either, so we can't detect this case that way) >> > > - thread returns from pthread_cond_wait() (maybe due to timeout, >> etc.) >> > > - thread calls pthread_cond_wait*(c2) >> > > - thread enqueued on c2 >> > > - another thread does pthread_cond_broadcast(c2), and bewm >> > > >> > > My question is is it possible for the thread to get interrupted and >> > > chosen to run a signal while it is on c1 somehow given my patch to >> defer >> > > signals around the wait loops (and is that patch correct btw given >> the >> > > above scenario?) >> > >> > Yes (and yes I think). Defering signals just means that the signal >> handler >> > won't try to install a signal frame on the current thread; instead it >> just >> > queues the signal and the scheduler will pick it up and send it to the >> > correct thread. >> > >> > I do think signals should be deferred for condition variables so >> > that setting the thread state (to PS_COND_WAIT) is atomic. >> > >> > It's not obvious to be where the bug is. If you had a simple >> > test case to reproduce it that would help. >> >> FWIW, we are having (I think) the same problem on 5.3 with libpthread. >> The >> panic there is in the mutex code about an assertion failing because a >> thread >> is on a syncq when it is not supposed to be. > > David and I recently fixed some races in pthread_join() and > pthread_exit() in -current libpthread. Don't know if those > were responsible... > > Here's a test program that shows correct behavior with both > libc_r and libpthread in -current. I am not a programmer, but wondering if this races fixed has to do with the SIGUSR1? Asking, because in lang/mono/files/patch-libgc_include_private_gcconfig.h is what fix the Mono build/install hang. If I remove this file and that races fixed should fix this? Maybe, I can backport the fixes in -CURRENT to RELENG_5 by myself and try to test it. What lang/mono/files/patch-libgc_include_private_gcconfig.h has look like this: ===================================== -# define SIG_SUSPEND SIGUSR1 -# define SIG_THR_RESTART SIGUSR2 +# define SIG_SUSPEND SIGTSTP +# define SIG_THR_RESTART SIGCONT ===================================== Cheers, Mezz -- mezz7@cox.net - mezz@FreeBSD.org FreeBSD GNOME Team http://www.FreeBSD.org/gnome/ - gnome@FreeBSD.org