Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 29 Nov 2012 08:37:24 -0700
From:      Ian Lepore <freebsd@damnhippie.dyndns.org>
To:        "Robert N. M. Watson" <rwatson@freebsd.org>
Cc:        freebsd-arch@freebsd.org
Subject:   Re: Print a (rate-limited) warning when UMA zone is full.
Message-ID:  <1354203444.69940.205.camel@revolution.hippie.lan>
In-Reply-To: <98FCA89B-D1DF-4002-B44F-A59DCB5ED020@FreeBSD.org>
References:  <20121129090147.GB1370@garage.freebsd.pl> <alpine.BSF.2.00.1211291027430.59662@fledge.watson.org> <20121129103752.GD1370@garage.freebsd.pl> <D7657157-0791-486D-8EF5-99488023E7ED@FreeBSD.org> <20121129105306.GE1370@garage.freebsd.pl> <0D8E588B-6FCB-4B01-9786-B5D42F16C3F0@FreeBSD.org> <20121129110518.GF1370@garage.freebsd.pl> <98FCA89B-D1DF-4002-B44F-A59DCB5ED020@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 2012-11-29 at 11:42 +0000, Robert N. M. Watson wrote:
> On 29 Nov 2012, at 11:05, Pawel Jakub Dawidek wrote:
> 
> > On Thu, Nov 29, 2012 at 10:56:32AM +0000, Robert N. M. Watson wrote:
> >> On 29 Nov 2012, at 10:53, Pawel Jakub Dawidek wrote:
> >>> Agreed, especially if reaching those limits is expected by the
> >>> administrator and he is not going to increase them. But in this case it
> >>> would be even better to provide a way to turn them off.
> >> 
> >> I wonder if each instance of a 'ratecheck' should come with an associated tunable/sysctl pair to allow suppression to be easily configured. I almost find myself wondering if we want something that looks a bit like our static SYSCTL/VFS_SET/etc declarations:
> >> 
> >> 	static RATECHECK(..., "foo.bar.baz", ...);
> >> 
> >> Unfortunately, the tunable/sysctl mismatch makes it slightly awkward since you'd need to declare both, but I think probably worthwhile.
> > 
> > I'm afraid you lost me here. Tunable/sysctl name is not related in any
> > way with the warning we are printing. How can you tell
> > kern.ipc.maxsockets affects limits of eight different UMA zones?
> > Also rate-limiting is not only used to print warnings, current
> > ppsratecheck() function just answer the question if the limit should be
> > enforced (something is happening too frequently) or not.
> 
> I meant something a bit different -- I was concerned with a per-instance sysctl/tunable to silence the warnings. For embedded systems or systems running at peak load, occasional allocation failures may be the steady state, in which case tuning down the rate of message printing, or simply disabling them rather than spamming logs, may be desirable.
> 
> Robert

I've been pondering the problem of rating-limiting error spewage and
automatic retry attempts on embedded systems for a long time.  Recently
I got around to writing something to deal with it.  What I wrote is
implemented as a C++ class, but it would translate to C code using a
struct and a couple related functions without too much trouble if
there's interest.

Extracted from the comment block describing it in the header file...

        RateLimiter     (See MTRateLimiter below for a thread-safe
        version).
        
        This helps you limit some action to "once every N seconds" with
        the option to specify multiple values of N so that the interval
        between actions increases over time.  It's especially useful to
        throttle error logging and error recovery attempts.  
        
        If you specify a single interval, this class functions as a
        simple "once every N seconds" filter, where RateCheck() returns
        true if it has been N seconds or more since it last returned
        true.  Boring.  
        
        The interesting part is that you can provide a series of
        increasing interval values so that the action becomes less
        frequent over time.  This can be especially useful for logging
        error conditions that are sometimes transient and sometimes go
        on for a long time.  
        
        For example, if you have code that polls hardware status once a
        second and logs any problems it finds, then you probably don't
        want to log once per second for as long as the problem lasts.
        On the other hand, you don't want to log it just once and hope
        someone sees that one crucial line in the log amongst all the
        other spewage that may be happening when there's some sort of
        problem going on.  When a problem first occurs it's worth
        mentioning often, maybe once a minute.  If the problem is still
        happening 15 minutes later then maybe once every 15 minutes is
        often enough.  But if the problem is still there after 12 hours,
        then mentioning it just a couple times a day from that point on
        is good enough.  This class lets you manage that by setting the
        intervals to 60, 900, 41300.  
        
        To use this critter to throttle UTDiag spewage, create an
        instance and set the interval(s), then call instance.RateCheck()
        every time you detect the error condition but only call UTDiag()
        when RateCheck() returns true.  A static instance at/near the
        point of use works well.  A typical usage scenario is often
        something like this: 
        
            while (!Stopped())
            {
                sleep(1);
                int statusBits = PollHardware();
                if (statusBits != 0)
                {
                    static TSC::RateLimiter throttle(60, 900, 7200, 86400);
                    if (throttle.RateCheck())
                    {
                        UTDiag("Franistan failure, status %#x", statusBits);
                        // could do other things here too.
                    }
                }
            }
        
        RateCheck() will internally manage advancement to the
        next-longer interval based on how long it has been since the
        prior actions.  The RateCheck() function is fairly lightweight
        and can be called often (it just gets the current time and does
        a bit of adding and subtracting).  
        
        If no call to RateCheck() is made for longer than the
        currently-active interval, the next call automatically resets
        the internal state back to the shortest interval.  Say what?
        Well, let's say the hardware problem is ongoing and you've been
        calling RateCheck() once per second for a while and now it's
        only returning true once every 15 minutes.  Magically, the
        hardware problem goes away and you don't call RateCheck() for a
        couple hours.  Then the problem comes back so you call
        RateCheck() again and it notices that the time elapsed since the
        last call is longer than last active interval (15 minutes), so
        this isn't really a continuation of the previous situation, it's
        a whole new situation that starts over with the smallest
        interval and decays again from there over time.  
        
        If you set multiple intervals, they should increase in value; if
        any given interval is shorter than the one before it, strange
        things are likely to happen.  A zero ends the list (the zeroes
        are provided by the default args in the functions).  If the
        first interval is zero, RateCheck() will always return false.  
        
        You can force the internal state back to the shortest interval
        at any time by calling Reset().  After a reset, the next call to
        RateCheck() will return true, (unless the shortest interval is
        zero).  The ctor calls Reset() internally, so the first call
        ever made to RateCheck() always returns true.  
        
-- Ian





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1354203444.69940.205.camel>