Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 20 Apr 2008 09:53:49 -0700
From:      Chris Pratt <eagletree@hughes.net>
To:        Robert Watson <rwatson@FreeBSD.org>
Cc:        net@freebsd.org
Subject:   Re: zonelimit issues...
Message-ID:  <33AC96BF-B9AC-4303-9597-80BC341B7309@hughes.net>
In-Reply-To: <20080420103258.D67663@fledge.watson.org>
References:  <m2hcdztsx2.wl%gnn@neville-neil.com> <48087C98.8060600@delphij.net> <382258DB-13B8-4108-B8F4-157F247A7E4B@hughes.net> <20080420103258.D67663@fledge.watson.org>

next in thread | previous in thread | raw e-mail | index | archive | help

On Apr 20, 2008, at 2:43 AM, Robert Watson wrote:

>
> On Fri, 18 Apr 2008, Chris Pratt wrote:
>
>> Doesn't 7.0 fix this? I'd like to see an official definitive  
>> answer and all I've been going on is that the problem description  
>> is no longer in the errata.
>
> Unfortunately, bugs of this sort don't really "work" that way --  
> specific bugs are a property of a problem in code (or a problem in  
> design), but what we have right now is a report of a symptom that  
> might reflect zero or more specific bugs.  It's unclear that the  
> problem described in errata is the problem you've been  
> experiencing, or that the (at least one) fixed bug with the same  
> symptoms is that one you've been experiencing.  For better or  
> worse, the only way to really tell of a generic class of hang or  
> wedging is fixed is to try out the new version and see.  In most  
> cases, "zonelimit" wedging reflects one of two things:
>
> (1) Inadequate resource allocation to the network stack or some other
>     component, try tuning up the memory tunable for clusters (for  
> example).
>
For several months I did quite a bit of tuning. I never increased
nmbclusters beyond the 32768 shown in the docs because man
tuning doesn't define it's use of "arbitrarily high". Inability to boot
could mean travel. Kris Kenneway had provided instructions to
get a dump. I set up for that but have never had a dump. The
only respite came from adding another circuit, another NIC and
spreading traffic. We increased our lock time from every couple
of days during the heavy bot period of late 2006 to now every
month or during traditionally slow months, even two months.
For example, we ran a record 72 days last summer. It was a
very dead summer traffic wise.

I will try to increase the nmbclusters dramatically if I can figure
out what a safe top limit is but it sounds like the jump to
7.0 RELEASE may be worth the effort. I would want to wait
until this issue with TCP, Windows and certain routers is well
past. I had not seen that applied to 7_0_0 yet and that would be
a show stopper. Is there a way to know what is safe for
nmbclusters given an 8GB ram system?

I did vmstats data collection for a couple of months when things
were at their worst. The results were nebulous to me based
on lack of code knowledge. All I actually found was that a
certain counter would drop to 0 and never recover. I didn't
know if it was meaningful and received no replies when I
asked FreeBSD-Questions. It was 128-Bucket or something
like that.

> (2) A memory leak in a network device driver or other network part,  
> which
>     needs to be debugged and fixed.
>

Initially I thought there may be something related to the bge
driver and moved the high traffic apps on an em. This didn't
seem to help much, nor did polling.

I am most willing to collect data if I could figure out how to
collect something meaningful. I gather from what you say,
that 7.0 would provide this.

I really appreciate both of your responses. Just based on
this one problem, 6.x has been a bad experience after
years of seemingly impossible uptime on 4 and 5.x
FreeBSD.

> On at least one prior occasion, there has been a bug in UMA itself  
> that lead to getting stuck in zonelimit, and it's not impossible  
> there's a scheduler sleep/wakeup bug that would lead to a similar  
> symptom but for a different reason.
>
> In FreeBSD 7-STABLE, you can now use procstat -k to print kernel  
> stack traces of user threads blocked in kernel, which may make  
> diagnosing the general class of problem a bit easier without using  
> a kernel debugger.  "zonelimit" is the generic wait channel across  
> all memory type and allocation paths, so doesn't reveal a lot about  
> *which* limit is being hit.  Using a kernel stack trace, we can see  
> which specific memory type and allocation context is involved.
>
> Robert N M Watson
> Computer Laboratory
> University of Cambridge
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?33AC96BF-B9AC-4303-9597-80BC341B7309>