From owner-freebsd-threads@FreeBSD.ORG Wed Aug 17 16:18:10 2005 Return-Path: X-Original-To: freebsd-threads@freebsd.org Delivered-To: freebsd-threads@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 19E9D16A41F for ; Wed, 17 Aug 2005 16:18:10 +0000 (GMT) (envelope-from ghelmer@palisadesys.com) Received: from magellan.palisadesys.com (magellan.palisadesys.com [192.188.162.211]) by mx1.FreeBSD.org (Postfix) with ESMTP id BC12B43D45 for ; Wed, 17 Aug 2005 16:18:09 +0000 (GMT) (envelope-from ghelmer@palisadesys.com) Received: from [172.16.1.108] (cetus.palisadesys.com [192.188.162.7]) (authenticated bits=0) by magellan.palisadesys.com (8.12.11/8.12.11) with ESMTP id j7HGHrFv080332; Wed, 17 Aug 2005 11:17:55 -0500 (CDT) (envelope-from ghelmer@palisadesys.com) Message-ID: <43036330.9000501@palisadesys.com> Date: Wed, 17 Aug 2005 11:17:52 -0500 From: Guy Helmer User-Agent: Mozilla Thunderbird 1.0.6 (Windows/20050716) X-Accept-Language: en-us, en MIME-Version: 1.0 To: Julian Elischer References: <42D691F2.3030201@palisadesys.com> <42D6BA3E.1000306@elischer.org> <42D7BBB8.9050207@palisadesys.com> <42D8199E.1060702@elischer.org> In-Reply-To: <42D8199E.1060702@elischer.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Palisade-MailScanner-Information: Please contact the ISP for more information X-Palisade-MailScanner: Found to be clean X-MailScanner-From: ghelmer@palisadesys.com Cc: freebsd-threads@freebsd.org Subject: Re: system scope threads entering STOP state X-BeenThere: freebsd-threads@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Threading on FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Aug 2005 16:18:10 -0000 Julian Elischer wrote: > Guy Helmer wrote: > >> Julian Elischer wrote: >> >>> Guy Helmer wrote: >>> >>>> I have a long-running multithreaded process on FreeBSD 5.4 (SMP, >>>> PREEMTPION, SCHED_4BSD) linked with libpthread and I'm creating the >>>> threads with attribute PTHREAD_SCOPE_SYSTEM. The threads need to >>>> be processing input in near-real-time or its input buffers overflow. >>>> >>>> I've modified the program so that a thread can fork/execl/waitpid >>>> (without WNOHANG) to use an external program for further processing >>>> on a batch of input (sometimes via a pipe, other times via writing >>>> to a file). However, even under a light input load, the program is >>>> now dropping input. While running top(1) in thread mode, I >>>> occasionally find all the program's threads are in the STOP state >>>> for several consecutive seconds. Is there anything related to the >>>> frequent use of fork, execve, or wait4 that would be likely to >>>> cause such a situation? I'm not seeing anything obvious in my >>>> reading of the kernel sources. >>> >>> duirng a fork the parent process is in a variant of the "STOPPED" >>> state, or, rather, if you >>> look at top -H you should see that all teh threads except for that >>> doing the fork, are in >>> the STOPPED state. >>> >>> This is because while a thread is forking the process needs to be >>> single threaded so that >>> there is a consistent image to be copied to teh child. >>> >>> the single threaded state is also enterred for exit() and execve(), >>> though that should not affect your program. >>> >>> I can't imagine why the state would persist for any length of time, >>> unless there is another thread >>> that is in an uninterruptible wait. In that case the other threads >>> have to wait for it to complete >>> what it is doing and come back. I have considerred whether such a >>> thread should not be considerred >>> "already suspended" and in fact some earlier versions of the code >>> did that, however it leads to some >>> inconsistancies and the danger that such a thread will be suspended >>> holding some resource >>> that it should not hold for any length of time. >> >> Thanks for the explanation. I was [aware] that the other threads >> would be stopped during a fork(2) but it looked to me like the STOP >> would be brief. >> Would an "uninterruptible wait" include system calls like a write(2) >> of a large buffer? That would explain it... > > it's hard to say.. Possibly yes, if it had to allocate buffer space. > However this is a question for > others.. > > Is it possible to duplicate this on request? [where did the past month go?] I think I found the culprit - I think the process in question was actually dumping core and it is a large process - between 50MB and 100MB - so that would explain the 10+ seconds all the threads were in the STOP state. It was difficult to notice while running top(1) since a watchdog process immediately restarts the multi-threaded process if it exits due to things like segfaults, and I was paying attention to the state column, not the PID column. Sorry for what was a bit of a wild-goose chase, Guy -- Guy Helmer, Ph.D. Principal System Architect Palisade Systems, Inc.