From owner-freebsd-hackers@FreeBSD.ORG Fri Aug 19 18:33:42 2005 Return-Path: X-Original-To: freebsd-hackers@freebsd.org Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C6D3516A41F for ; Fri, 19 Aug 2005 18:33:42 +0000 (GMT) (envelope-from lists@nbux.com) Received: from smtp6.wanadoo.fr (smtp6.wanadoo.fr [193.252.22.25]) by mx1.FreeBSD.org (Postfix) with ESMTP id CEDCA43D46 for ; Fri, 19 Aug 2005 18:33:41 +0000 (GMT) (envelope-from lists@nbux.com) Received: from me-wanadoo.net (localhost [127.0.0.1]) by mwinf0607.wanadoo.fr (SMTP Server) with ESMTP id 8DBC61C001F9 for ; Fri, 19 Aug 2005 20:33:40 +0200 (CEST) Received: from daneel.nbux.com (LNeuilly-152-22-15-131.w82-127.abo.wanadoo.fr [82.127.94.131]) by mwinf0607.wanadoo.fr (SMTP Server) with ESMTP id 44FB61C001F1 for ; Fri, 19 Aug 2005 20:33:40 +0200 (CEST) X-ME-UUID: 20050819183340282.44FB61C001F1@mwinf0607.wanadoo.fr Received: from [192.168.42.2] (daneel.nbux.com [192.168.42.2]) by daneel.nbux.com (Postfix) with ESMTP id D9AE41BFC82 for ; Fri, 19 Aug 2005 20:33:39 +0200 (CEST) Message-ID: <43062603.5050206@nbux.com> Date: Fri, 19 Aug 2005 20:33:39 +0200 From: Christophe Yayon Organization: nbux.com User-Agent: Mozilla Thunderbird 1.0.6 (Macintosh/20050716) X-Accept-Language: en-us, en MIME-Version: 1.0 To: freebsd-hackers@freebsd.org Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: nagios and freebsd threads issue : help please ... X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 19 Aug 2005 18:33:43 -0000 Hi all You should know about freebsd and nagios 2.0b threads issues (100% cpu use by a forked process, lost check result, some pause of nagios main process in certains obscursives conditions...). Some Nagios developpers says that the problem is in FreeBSD and some other says that the problem is in nagios pthreads implementation, here a resume of our discussions : ------- The thread I started is here: http://marc.theaimsgroup.com/?t=111930118000001&r=1&w=2 There are some very interesting replies, a few in particular note that Nagios may be breaking POSIX spec in how it spawns/destroys threads: http://marc.theaimsgroup.com/?l=freebsd-hackers&m=111944526323754&w=2 http://marc.theaimsgroup.com/?l=freebsd-hackers&m=111945035012258&w=2 Anyhow, I"m sure if Ethan were to post some more specific info to freebsd-hackers@fr... (it"s an open list, no need to sub), this issue could get banged out pretty quickly. Shortly after this thread, I found another where the issue was brought up by another curious poster, and he was using 5.4, which uses a newer threading library: http://marc.theaimsgroup.com/?t=112119712600002&r=1&w=2 This post again brings up the "fork without exec or exit" possibly not following spec: http://marc.theaimsgroup.com/?l=freebsd-hackers&m=112125883804481&w=2 "I don"t know what Nagios does just after fork(2), it would be worth to check. It appears that fork(2)ing without exec(2)ing or _exit(2)ing in a pthreaded program is not a "valid" behaviour, regarding to SUSv3 [1]. I don"t want to avoid admitting there is a problem in FreeBSD threading library, I don"t know how other OSes handle this, but Nagios folks should really avoid doing what is explicitely dissuaded in SUSv3." -------- -------- As the problem isn't in Nagios and noone seems to have an authoritative answer on what exactly is causing it, I'd say you would be better off switching to a GNU/Linux system, with at least Linux 2.4.29 and glibc-2.3 (a lot work was put into thread-safeness on glibc-2.3). -------- -------- From http://www.opengroup.org/onlinepubs/009695399/functions/pthread_atfork.html "It is suggested that programs that use fork() call an exec function very soon afterwards in the child process, thus resetting all states. In the meantime, only a short list of async-signal-safe library routines are promised to be available." Note *suggested*. This is a recommendation to protect against a shoddy pthread-implementation. The thread specifications rule that only the thread calling fork() is duplicated, which initially leads to the recommendation (other threads holding locks aren't around to release them in the new execution context). That said, Nagios would most likely benefit greatly from a different means of checking things than fork()'ing twice and sending the results through several tiers of FIFO's. Several different methods have already been benchmarked. For server machines (or at least cans with a lot of memory and quite regularly multiple CPU's), the best way seems to be to create a new thread for each check to run. popen() causes a fork() and execve(), so that should be safe enough. What limits this imposes I don't know, but the NPTL library in use on most modern linux systems today handles 10.000 threads without barfing, so the limit would probably be sysconf(_SC_MAX_FILES), or ulimit -n, which is required by posix to be at least 256. Note that half this value (give or take 5 or so for stdin and such) represents the number of checks that can run simultaneously at any given time. When one of them completes another can kick in. -------- What do you think about this ? Should we have a specific threads nagios patch for FreeBSD ? Nagios problem or FreeBSD problem ? Should we switch our Nagios systems to Linux (which is very psychological difficult for me ...) ? Thanks in advance for your help... I hope we will found a solution...