Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 02 Oct 2008 16:55:25 +0930
From:      Wayne Sierke <ws@au.dyndns.ws>
To:        sclark46@earthlink.net
Cc:        FreeBSD Stable <freebsd-stable@freebsd.org>, Jeremy Chadwick <koitsu@freebsd.org>, Robert Watson <rwatson@freebsd.org>
Subject:   Re: resource leak
Message-ID:  <1222932325.2581.277.camel@predator-ii.buffyverse>
In-Reply-To: <48E3DF5E.6040607@earthlink.net>
References:  <48E36204.5090108@earthlink.net> <20081001115046.GA20384@icarus.home.lan> <20081001164856.GA6478@in-addr.com> <alpine.BSF.1.10.0810011854350.9076@fledge.watson.org> <48E3DF5E.6040607@earthlink.net>

next in thread | previous in thread | raw e-mail | index | archive | help

On Wed, 2008-10-01 at 16:36 -0400, Stephen Clark wrote: 
> Robert Watson wrote:
> > On Wed, 1 Oct 2008, Gary Palmer wrote:
> > 
> >> "ps alxw" may be of interest in addition to "ps auxw" as it displays 
> >> what the processes are waiting on.  It could conceivably be a problem 
> >> of some kind at the filesystem level.  I've seen situations before 
> >> where a problem escalates to the point where "ls /" hangs, and at that 
> >> point you're stuck with an unresponsive box.
> > 
> > If you want an even greater level of detail than ps -l, you can use 
> > procstat -k to generate kernel stack traces for all user/kernel 
> > threads.  Wait channels are very useful, but they only tell you what the 
> > code that invoked the wait thinks it is for, not how that code was 
> > reached.  A classic example is waiting on an exhausted UMA zone -- you 
> > get a uma wait channel, but no indication of what subsystem performed 
> > the memory allocation...  This required FreeBSD 7.1 and higher, 
> > however.  (Obviously, the same can be done easily using DDB, but that's 
> > hard on a box without a serial console, and requires interrupting the 
> > flow of the operating system, compiling with DDB, etc).
> > 
> > Robert N M Watson
> > Computer Laboratory
> > University of Cambridge
> > 
> A big part of problem is this seems to take about 100 days of uptime to occur. 
> We have some inhouse test boxes but have never seen the problem, probably 
> because non of them have been up more than about 45 days. The units in the 
> field, of which there is about 300, are headless and none are physically close.
> 
> When the boxes are rebooted there are no error messages in any of the log files,
> only the absence of information that would normally be logged by new processes 
> that would be spawned. We are getting ready to install a patch that will try to
> gather more information.
> 
> I thought about writing an app the would try to fork a child periodically and 
> record in a log file if there was an error. But EAGAIN is nonspecific as to the 
> real reason the fork failed. I was looking for some way to periodically log the
> resources that would cause the fork failure.
> 
> procstat -k looks like it would have been a good candidate but unfortunately we
> are running 6.1.
> 
> Thanks for the response.
> Steve


I have a VIA EPIA-based system that was rebooting and not leaving behind
any diagnosable evidence that I could find. Attaching a serial console
revealed a kernel-trap which was double-faulting as it went to write the
kernel dump. I've not yet had the opportunity to investigate further
except that out of desperation I threw in an additional 64M of RAM - all
I had to hand - adding to its 256M and I haven't seen it fault again in
the 37 days since (it would often stay up for less than a day before
that).

I wonder whether it would be worth your while running a bench unit with
limited RAM, either physically or via the hw.physmem tunable. I would
probably try to identify the amount of RAM that just allows it to run
"normally", ideally subjecting it to a typical workload if possible. If
it bombs after running for a reasonable length of time, add back a
fraction of the unused memory and see if it then stays up proportionally
longer which could be indicative of a memory starvation issue.

If you can get it to bomb in the above scenario then you can probably
get some insight into where it's failing. Having said that, I should
point out that I've not actually used the above technique so I may well
be overlooking something which might prevent it from being useful or
indeed from "working" at all.


Wayne





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1222932325.2581.277.camel>