From owner-freebsd-stable@FreeBSD.ORG  Thu Oct  2 12:08:22 2008
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id D862210656AB;
	Thu,  2 Oct 2008 12:08:22 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id AF1178FC1A;
	Thu,  2 Oct 2008 12:08:22 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [65.122.17.41])
	by cyrus.watson.org (Postfix) with ESMTP id 5A69446B06;
	Thu,  2 Oct 2008 08:08:22 -0400 (EDT)
Date: Thu, 2 Oct 2008 13:08:22 +0100 (BST)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: Stephen Clark <sclark46@earthlink.net>
In-Reply-To: <48E3DF5E.6040607@earthlink.net>
Message-ID: <alpine.BSF.1.10.0810021306370.9076@fledge.watson.org>
References: <48E36204.5090108@earthlink.net>
	<20081001115046.GA20384@icarus.home.lan>
	<20081001164856.GA6478@in-addr.com>
	<alpine.BSF.1.10.0810011854350.9076@fledge.watson.org>
	<48E3DF5E.6040607@earthlink.net>
User-Agent: Alpine 1.10 (BSF 962 2008-03-14)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Gary Palmer <gpalmer@FreeBSD.org>, Jeremy Chadwick <koitsu@FreeBSD.org>,
	FreeBSD Stable <freebsd-stable@FreeBSD.org>
Subject: Re: resource leak
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 02 Oct 2008 12:08:22 -0000


On Wed, 1 Oct 2008, Stephen Clark wrote:

> A big part of problem is this seems to take about 100 days of uptime to 
> occur. We have some inhouse test boxes but have never seen the problem, 
> probably because non of them have been up more than about 45 days. The units 
> in the field, of which there is about 300, are headless and none are 
> physically close.
>
> When the boxes are rebooted there are no error messages in any of the log 
> files, only the absence of information that would normally be logged by new 
> processes that would be spawned. We are getting ready to install a patch 
> that will try to gather more information.
>
> I thought about writing an app the would try to fork a child periodically 
> and record in a log file if there was an error. But EAGAIN is nonspecific as 
> to the real reason the fork failed. I was looking for some way to 
> periodically log the resources that would cause the fork failure.

The narrowness of the UNIX errno space is, at times, fairly unhelpful.

As far as I'm aware, the two main causes of EAGAIN out of fork() are an 
exhaustion of maxprocs or an exhaustion of per-user process limits.  This 
suggests one or more run-away applications or services, or a gradual leak of 
processes from a service (perhaps a failure to GC dead children, or a gradual 
increase but never decrease of worker processes?).

Robert N M Watson
Computer Laboratory
University of Cambridge

>
> procstat -k looks like it would have been a good candidate but unfortunately 
> we
> are running 6.1.
>
> Thanks for the response.
> Steve
>
> -- 
>
> "They that give up essential liberty to obtain temporary safety,
> deserve neither liberty nor safety."  (Ben Franklin)
>
> "The course of history shows that as a government grows, liberty
> decreases."  (Thomas Jefferson)
>
>
>