From owner-freebsd-hackers  Thu Jan 28 10:04:51 1999
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id KAA19772
          for freebsd-hackers-outgoing; Thu, 28 Jan 1999 10:04:51 -0800 (PST)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from apollo.backplane.com (apollo.backplane.com [209.157.86.2])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id KAA19757
          for <hackers@FreeBSD.ORG>; Thu, 28 Jan 1999 10:04:45 -0800 (PST)
          (envelope-from dillon@apollo.backplane.com)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.9.2/8.9.1) id KAA09766;
	Thu, 28 Jan 1999 10:04:41 -0800 (PST)
	(envelope-from dillon)
Date: Thu, 28 Jan 1999 10:04:41 -0800 (PST)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <199901281804.KAA09766@apollo.backplane.com>
To: "John S. Dyson" <dyson@iquest.net>
Cc: wes@softweyr.com, toasty@home.dragondata.com, hackers@FreeBSD.ORG
Subject: Re: High Load cron patches - comments?
References:  <199901281734.MAA21561@y.dyson.net>
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

:Throttling fork rate is also a valuable tool, and maybe a hard limit is good
:also.  It is all about how creative you are (or want to be) in your solution :-).

    Throttling the fork rate immediately leads to complaints.  The perception
    of load is easily as important as the reality.  We had put fork rate 
    limits on both sendmail and popper and the result was a hundreds of calls
    to tech support :-(.  I even had load-based feedback mechanisms.  It was
    a disaster.

    The issue is that the load is an interactive load, not a batch load -- it
    is not acceptable to accept a connection and then pause for 5 minutes
    before yielding a shell prompt, process a popper request, or even
    respond with an SMTP HELO.  Or handle a web request.  The machine *must*
    be able to handle a temporary overload.  Even *mail delivery* is an 
    interactive load -- users have come to expect their email to propogate in
    5 minutes or less and if it doesn't, we got complaints.

    While BEST is certainly not indicative of all situations, we cover the
    spectrum pretty well for general-use server installations:  There are
    shell/web machines, mail servers, mail frontend and backend boxes, 
    mailing list servers, news feed boxes, DNS boxes, radius boxes, etc etc 
    etc.  Each one operates under load differently and requires different
    hard limits.  In BEST's earlier days, functions were combined ( we didn't
    have the money to buy lots of machines ).   For example, the mail machines
    are tuned with fork limits such that sendmail is able to eat around 90%
    of the machine's resources worst case, but sendmail on the shell/web
    servers are tuned with fork limits such that they can't eat more then
    50%.

    The only thing that ever worked reliably were absolute limits.  The
    internet is so bursty that a machine *must* be able to accept a high
    load or overload situation for upwards of 10 or 15 minutes *without*
    slapping limits on processes.  That is, it must allow the processes
    to build up in such burst-load situations.  An absolute limit works 
    extremely well for this sort of response requirement.  It says, in 
    effect, "I will allow you to overload the machine to a point, as long
    as you can recover from it eventually".  i.e. even though you are not
    allowing any one subsystem to overload the machine, summing all the
    hard limits together yields a number > 100 so, in effect, you are 
    allowing a subsystem to take the machine over the top.

    In effect, allowing a machine's load to pass 25 for a few minutes
    is perfectly acceptable so long a the machine can recover, but 
    slapping load-limiting limitations on, say, forks ( this being different
    from the absolute limit ) simply creates a cascade failure situation
    earlier that might have been avoided if you let the machine run with
    it a little longer.  The absolute limit in effect allows the machine
    to temporarily overload while still maintaining responsiveness, and
    operates on the assumption that the 'burst' will not last forever.
    Since the burst is already generating a higher load then you would
    nominally allow, this temporary overloading will do a better job for
    a short period of time.  

    If the temporary overloading becomes more permanent, *both* the absolute
    limitation methodology and the dynamic feedback limiting methodology
    have the same problem:  You've run out of cpu, or memory, or disk I/O,
    or all three... and no matter what you do you will piss some customer
    off.  Both methodologies can prevent a machine from going poof, but I
    firmly believe that the absolute-limit methodology will allow a machine
    to recover much more quickly from an overload then a dynamic balancing 
    methodology.  That is my experience, anyway. 

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

:-- 
:John                  | Never try to teach a pig to sing,
:dyson@iquest.net      | it makes one look stupid
:jdyson@nc.com         | and it irritates the pig.
:


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message