From owner-freebsd-hackers  Thu Nov  8 18:50:20 2001
Delivered-To: freebsd-hackers@freebsd.org
Received: from motgate.mot.com (motgate.mot.com [129.188.136.100])
	by hub.freebsd.org (Postfix) with ESMTP id 77CE637B405
	for <freebsd-hackers@freebsd.org>; Thu,  8 Nov 2001 18:50:09 -0800 (PST)
Received: [from pobox4.mot.com (pobox4.mot.com [10.64.251.243]) by motgate.mot.com (motgate 2.1) with ESMTP id TAA09476 for <freebsd-hackers@freebsd.org>; Thu, 8 Nov 2001 19:50:08 -0700 (MST)]
Received: [from latour.rsch.comm.mot.com (latour.rsch.comm.mot.com [145.1.80.116]) by pobox4.mot.com (MOT-pobox4 2.0) with ESMTP id TAA05476 for <freebsd-hackers@freebsd.org>; Thu, 8 Nov 2001 19:50:08 -0700 (MST)]
Received: (from rittle@localhost)
	by latour.rsch.comm.mot.com (8.11.6/8.11.4) id fA92o7h55180;
	Thu, 8 Nov 2001 20:50:07 -0600 (CST)
	(envelope-from rittle)
Date: Thu, 8 Nov 2001 20:50:07 -0600 (CST)
From: Loren James Rittle <rittle@latour.rsch.comm.mot.com>
Message-Id: <200111090250.fA92o7h55180@latour.rsch.comm.mot.com>
To: freebsd-hackers@freebsd.org
Subject: Report on FreeBSD 4.4 pthread implementation verses boehm-gc
Reply-To: rittle@labs.mot.com
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-hackers.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-hackers>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-hackers>
X-Loop: FreeBSD.ORG

Hello all,

I have ported the most recent version of boehm-gc (6.1-alpha) to
FreeBSD/i386 under the auspice of the gcc project (it will be in Hans'
6.1 release and it is on the gcc mainline).  I got one notable thing
fully configured beyond what is in the ports tree (which is based on
6.0): threaded GC is now supported.  However, this work has uncovered
either a rare race condition in the 4.X pthread implementation (also
seen on a current 5.0 system) or a bad assumption in the GC signal
code (abstracted below).  Either way, the result seen is an undetected
deadlock.  With the following new assertion, I can at least force the
condition to be detectable in many cases where it would have locked up.

Two questions come to mind: Is there any condition under which my new
assumption should not be true?  Is there any obvious mistake that a
threaded application can make (perhaps related to its signal use) that
could cause the new assumption to ever be violated?

Index: uthread/uthread_exit.c
===================================================================
RCS file: /home/ncvs/src/lib/libc_r/uthread/uthread_exit.c,v
retrieving revision 1.16.2.3
diff -c -r1.16.2.3 uthread_exit.c
*** uthread/uthread_exit.c	12 Jul 2001 21:03:38 -0000	1.16.2.3
--- uthread/uthread_exit.c	7 Nov 2001 04:18:51 -0000
***************
*** 217,222 ****
--- 217,224 ----
  			pthread->suspended = SUSP_NO;
  			break;
  		case SUSP_NO:
+ 			PTHREAD_ASSERT ((pthread->state == PS_JOIN),
+ 					"Target of join has wrong state");
  			/* Make the joining thread runnable: */
  			PTHREAD_NEW_STATE(pthread, PS_RUNNING);
  			break;

I have also seen what I thought was a less important issue, but I now
see that it is probably related.  After reviewing the FreeBSD uthread
source code, the issue appears to be a race between the pthread_exit()
code running in one thread and the pthread_join() code running in
another thread in conjunction with a sigsuspend() call occurring on a
signal handler of that second thread.  Under some conditions, an
errant EINTR would be returned to the pthread_join() caller instead of
the exit code from the terminated thread.  Under other timing
conditions, you get the deadlock spotted with the above new assertion.

This test program displays the problem (I only know how to make the
deadlock/assertion failure reproducible not the errant return code):

/* This code is an abstraction of that which is found in both
   _Programming with POSIX Threads_ and boehm-gc (taken from 6.1-alpha
   but other versions appear similar). */
#include <unistd.h>
#include <pthread.h>
#include <signal.h>

void handler1 (int s)
{
  sigset_t mask;

  /* boehm-gc code uses a sem_post() and nominally blocks SIGUSR2
     inside this handler instead of the luck method, but that detail
     is not required to see the primary issue at hand. */

  sigfillset (&mask);
  sigdelset (&mask, SIGUSR2);
  sigsuspend (&mask);
}

void handler2 (int s)
{
  /* Do nothing.  Must exist to allow sigsuspend() to work properly. */
}

void* worker (void* arg)
{
  pthread_kill (*(pthread_t*)arg, SIGUSR1);
  sleep (1);
  pthread_kill (*(pthread_t*)arg, SIGUSR2);
}

int
main (void)
{
  pthread_t w1;
  pthread_t w2;
  pthread_t m = pthread_self ();

  signal (SIGUSR1, handler1);
  signal (SIGUSR2, handler2);

  pthread_create (&w2, NULL, worker, &m);

  return pthread_join (w2, NULL);
}

Comments?  Workaround for the GC code (other than switching to the _np
interface points to stop/start threads which was the whole point of
the signal tomfoolery)?  Best case: Anyone see how to better support
this test case in the 4.X uthread implementation?

Regards,
Loren

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message