From owner-freebsd-hackers Thu Nov 8 18:50:20 2001 Delivered-To: freebsd-hackers@freebsd.org Received: from motgate.mot.com (motgate.mot.com [129.188.136.100]) by hub.freebsd.org (Postfix) with ESMTP id 77CE637B405 for ; Thu, 8 Nov 2001 18:50:09 -0800 (PST) Received: [from pobox4.mot.com (pobox4.mot.com [10.64.251.243]) by motgate.mot.com (motgate 2.1) with ESMTP id TAA09476 for ; Thu, 8 Nov 2001 19:50:08 -0700 (MST)] Received: [from latour.rsch.comm.mot.com (latour.rsch.comm.mot.com [145.1.80.116]) by pobox4.mot.com (MOT-pobox4 2.0) with ESMTP id TAA05476 for ; Thu, 8 Nov 2001 19:50:08 -0700 (MST)] Received: (from rittle@localhost) by latour.rsch.comm.mot.com (8.11.6/8.11.4) id fA92o7h55180; Thu, 8 Nov 2001 20:50:07 -0600 (CST) (envelope-from rittle) Date: Thu, 8 Nov 2001 20:50:07 -0600 (CST) From: Loren James Rittle Message-Id: <200111090250.fA92o7h55180@latour.rsch.comm.mot.com> To: freebsd-hackers@freebsd.org Subject: Report on FreeBSD 4.4 pthread implementation verses boehm-gc Reply-To: rittle@labs.mot.com Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Hello all, I have ported the most recent version of boehm-gc (6.1-alpha) to FreeBSD/i386 under the auspice of the gcc project (it will be in Hans' 6.1 release and it is on the gcc mainline). I got one notable thing fully configured beyond what is in the ports tree (which is based on 6.0): threaded GC is now supported. However, this work has uncovered either a rare race condition in the 4.X pthread implementation (also seen on a current 5.0 system) or a bad assumption in the GC signal code (abstracted below). Either way, the result seen is an undetected deadlock. With the following new assertion, I can at least force the condition to be detectable in many cases where it would have locked up. Two questions come to mind: Is there any condition under which my new assumption should not be true? Is there any obvious mistake that a threaded application can make (perhaps related to its signal use) that could cause the new assumption to ever be violated? Index: uthread/uthread_exit.c =================================================================== RCS file: /home/ncvs/src/lib/libc_r/uthread/uthread_exit.c,v retrieving revision 1.16.2.3 diff -c -r1.16.2.3 uthread_exit.c *** uthread/uthread_exit.c 12 Jul 2001 21:03:38 -0000 1.16.2.3 --- uthread/uthread_exit.c 7 Nov 2001 04:18:51 -0000 *************** *** 217,222 **** --- 217,224 ---- pthread->suspended = SUSP_NO; break; case SUSP_NO: + PTHREAD_ASSERT ((pthread->state == PS_JOIN), + "Target of join has wrong state"); /* Make the joining thread runnable: */ PTHREAD_NEW_STATE(pthread, PS_RUNNING); break; I have also seen what I thought was a less important issue, but I now see that it is probably related. After reviewing the FreeBSD uthread source code, the issue appears to be a race between the pthread_exit() code running in one thread and the pthread_join() code running in another thread in conjunction with a sigsuspend() call occurring on a signal handler of that second thread. Under some conditions, an errant EINTR would be returned to the pthread_join() caller instead of the exit code from the terminated thread. Under other timing conditions, you get the deadlock spotted with the above new assertion. This test program displays the problem (I only know how to make the deadlock/assertion failure reproducible not the errant return code): /* This code is an abstraction of that which is found in both _Programming with POSIX Threads_ and boehm-gc (taken from 6.1-alpha but other versions appear similar). */ #include #include #include void handler1 (int s) { sigset_t mask; /* boehm-gc code uses a sem_post() and nominally blocks SIGUSR2 inside this handler instead of the luck method, but that detail is not required to see the primary issue at hand. */ sigfillset (&mask); sigdelset (&mask, SIGUSR2); sigsuspend (&mask); } void handler2 (int s) { /* Do nothing. Must exist to allow sigsuspend() to work properly. */ } void* worker (void* arg) { pthread_kill (*(pthread_t*)arg, SIGUSR1); sleep (1); pthread_kill (*(pthread_t*)arg, SIGUSR2); } int main (void) { pthread_t w1; pthread_t w2; pthread_t m = pthread_self (); signal (SIGUSR1, handler1); signal (SIGUSR2, handler2); pthread_create (&w2, NULL, worker, &m); return pthread_join (w2, NULL); } Comments? Workaround for the GC code (other than switching to the _np interface points to stop/start threads which was the whole point of the signal tomfoolery)? Best case: Anyone see how to better support this test case in the 4.X uthread implementation? Regards, Loren To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message