From owner-freebsd-threads@FreeBSD.ORG Fri Jun 11 06:10:15 2004 Return-Path: Delivered-To: freebsd-threads@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id AC4B416A4CE; Fri, 11 Jun 2004 06:10:15 +0000 (GMT) Received: from mail.mcneil.com (rrcs-west-24-199-45-54.biz.rr.com [24.199.45.54]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6150C43D41; Fri, 11 Jun 2004 06:10:15 +0000 (GMT) (envelope-from sean@mcneil.com) Received: from localhost (localhost.mcneil.com [127.0.0.1]) by mail.mcneil.com (Postfix) with ESMTP id 3BF8DFD059; Thu, 10 Jun 2004 23:09:52 -0700 (PDT) Received: from mail.mcneil.com ([127.0.0.1]) by localhost (server.mcneil.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 38633-06; Thu, 10 Jun 2004 23:09:51 -0700 (PDT) Received: from [24.199.45.54] (mcneil.com [24.199.45.54]) by mail.mcneil.com (Postfix) with ESMTP id 7A8CAFD01A; Thu, 10 Jun 2004 23:09:51 -0700 (PDT) From: Sean McNeil To: Daniel Eischen In-Reply-To: References: Content-Type: text/plain Message-Id: <1086934191.85039.1.camel@server.mcneil.com> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.4.6 Date: Thu, 10 Jun 2004 23:09:51 -0700 Content-Transfer-Encoding: 7bit X-Virus-Scanned: by amavisd-new at mcneil.com cc: David Xu cc: freebsd-threads@freebsd.org Subject: Re: signal handler priority issue X-BeenThere: freebsd-threads@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Threading on FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Jun 2004 06:10:15 -0000 On Thu, 2004-06-10 at 23:02, Daniel Eischen wrote: > On Thu, 10 Jun 2004, Sean McNeil wrote: > > > On Thu, 2004-06-10 at 22:04, David Xu wrote: > > > Can you provide some code to demostrate the problem ? I have interest. > > > > > > David Xu > > > > Yes, here is some code snippets from boehm-gc: > > > > master thread doing: > > > > int GC_suspend_all() > > { > > int n_live_threads = 0; > > int i; > > GC_thread p; > > int result; > > pthread_t my_thread = pthread_self(); > > > > GC_stopping_thread = my_thread; /* debugging only. */ > > GC_stopping_pid = getpid(); /* debugging only. */ > > for (i = 0; i < THREAD_TABLE_SZ; i++) { > > for (p = GC_threads[i]; p != 0; p = p -> next) { > > if (p -> id != my_thread) { > > if (p -> flags & FINISHED) continue; > > if (p -> stop_info.last_stop_count == GC_stop_count) continue; > > if (p -> thread_blocked) /* Will wait */ continue; > > n_live_threads++; > > #if DEBUG_THREADS > > GC_printf1("Sending suspend signal to 0x%lx\n", p -> id); > > #endif > > > > result = pthread_kill(p -> id, SIG_SUSPEND); > > switch(result) { > > case ESRCH: > > /* Not really there anymore. Possible? */ > > n_live_threads--; > > break; > > case 0: > > break; > > default: > > ABORT("pthread_kill failed"); > > } > > } > > } > > } > > return n_live_threads; > > } > > Am I blind, or is there no sem_wait() above after pthread_kill(). No, you aren't blind. It is done by the calling routine of above: n_live_threads = GC_suspend_all(); if (GC_retry_signals) { unsigned long wait_usecs = 0; /* Total wait since retry. */ # define WAIT_UNIT 3000 # define RETRY_INTERVAL 100000 for (;;) { int ack_count; sem_getvalue(&GC_suspend_ack_sem, &ack_count); if (ack_count == n_live_threads) break; if (wait_usecs > RETRY_INTERVAL) { int newly_sent = GC_suspend_all(); # ifdef CONDPRINT if (GC_print_stats) { GC_printf1("Resent %ld signals after timeout\n", newly_sent); } # endif sem_getvalue(&GC_suspend_ack_sem, &ack_count); if (newly_sent < n_live_threads - ack_count) { WARN("Lost some threads during GC_stop_world?!\n",0); n_live_threads = ack_count + newly_sent; } wait_usecs = 0; } usleep(WAIT_UNIT); wait_usecs += WAIT_UNIT; } } for (i = 0; i < n_live_threads; i++) { if (0 != (code = sem_wait(&GC_suspend_ack_sem))) { GC_err_printf1("Sem_wait returned %ld\n", (unsigned long)code); ABORT("sem_wait for handler failed"); } } > > and slave threads doing: > > > > void GC_suspend_handler(int sig) > > { > > int dummy; > > pthread_t my_thread = pthread_self(); > > GC_thread me; > > sigset_t mask; > > # ifdef PARALLEL_MARK > > word my_mark_no = GC_mark_no; > > /* Marker can't proceed until we acknowledge. Thus this is */ > > /* guaranteed to be the mark_no correspending to our */ > > /* suspension, i.e. the marker can't have incremented it yet. */ > > # endif > > word my_stop_count = GC_stop_count; > > > > if (sig != SIG_SUSPEND) ABORT("Bad signal in suspend_handler"); > > > > #if DEBUG_THREADS > > GC_printf1("Suspending 0x%lx\n", my_thread); > > #endif > > > > me = GC_lookup_thread(my_thread); > > /* The lookup here is safe, since I'm doing this on behalf */ > > /* of a thread which holds the allocation lock in order */ > > /* to stop the world. Thus concurrent modification of the */ > > /* data structure is impossible. */ > > if (me -> stop_info.last_stop_count == my_stop_count) { > > /* Duplicate signal. OK if we are retrying. */ > > if (!GC_retry_signals) { > > WARN("Duplicate suspend signal in thread %lx\n", > > pthread_self()); > > } > > return; > > } > > # ifdef SPARC > > me -> stop_info.stack_ptr = (ptr_t)GC_save_regs_in_stack(); > > # else > > me -> stop_info.stack_ptr = (ptr_t)(&dummy); > > # endif > > # ifdef IA64 > > me -> backing_store_ptr = (ptr_t)GC_save_regs_in_stack(); > > # endif > > > > /* Tell the thread that wants to stop the world that this */ > > /* thread has been stopped. Note that sem_post() is */ > > /* the only async-signal-safe primitive in LinuxThreads. */ > > sem_post(&GC_suspend_ack_sem); > > me -> stop_info.last_stop_count = my_stop_count; > > > > #if DEBUG_THREADS > > GC_printf2("Waiting for restart #%d of 0x%lx\n", my_stop_count, my_thread); > > #endif > > > > /* Wait until that thread tells us to restart by sending */ > > /* this thread a SIG_THR_RESTART signal. */ > > /* SIG_THR_RESTART should be masked at this point. Thus there */ > > /* is no race. */ > > You also need to make sure SIGUSR1 is masked while in this > function. It could possibly get another one after the sem_post() > and before the sigsuspend() below. > > > if (sigfillset(&mask) != 0) ABORT("sigfillset() failed"); > > if (sigdelset(&mask, SIG_THR_RESTART) != 0) ABORT("sigdelset() failed"); > > # ifdef NO_SIGNALS > > if (sigdelset(&mask, SIGINT) != 0) ABORT("sigdelset() failed"); > > if (sigdelset(&mask, SIGQUIT) != 0) ABORT("sigdelset() failed"); > > if (sigdelset(&mask, SIGTERM) != 0) ABORT("sigdelset() failed"); > > if (sigdelset(&mask, SIGABRT) != 0) ABORT("sigdelset() failed"); > > # endif > > do { > > me->stop_info.signal = 0; > > sigsuspend(&mask); /* Wait for signal */ > > } while (me->stop_info.signal != SIG_THR_RESTART); > > /* If the RESTART signal gets lost, we can still lose. That should be */ > > /* less likely than losing the SUSPEND signal, since we don't do much */ > > /* between the sem_post and sigsuspend. */ > > /* We'd need more handshaking to work around that, since we don't want */ > > /* to accidentally leave a RESTART signal pending, thus causing us to */ > > /* continue prematurely in a future round. */ > > > > #if DEBUG_THREADS > > GC_printf1("Continuing 0x%lx\n", my_thread); > > #endif > > } > > > > and here is the output with debug messages: > > > > Stopping the world from 0x50d000 > > Sending suspend signal to 0x9d1400 > > Suspending 0x9d1400 > > World stopped from 0x50d000 > > Pushing stacks from thread 0x50d000 > > Stack for thread 0x9d1400 = [7fffffeed95c,7fffffeee000) > > Stack for thread 0x50d000 = [7fffffffcf00,800000000000) > > World starting > > Sending restart signal to 0x9d1400 > > World started > > Buildfile: build.xml > > Stopping the world from 0x50d000 > > Sending suspend signal to 0x9d1400 > > Waiting for restart #2 of 0x9d1400 > > > > There are other things causing output, but you can see that the slave > > isn't getting to the sigsuspend before it is evoked again from the > > master thread. > > Something is weird. It's not getting the first restart, but > yet it's past the sem_post() in the second restart. Try displaying > the return from sem_post(). And where's the missing sem_wait()?