From owner-freebsd-hackers  Fri Jul 13 13: 3:36 2001
Delivered-To: freebsd-hackers@freebsd.org
Received: from harrier.mail.pas.earthlink.net (harrier.mail.pas.earthlink.net [207.217.121.12])
	by hub.freebsd.org (Postfix) with ESMTP id BF5A137B401
	for <freebsd-hackers@FreeBSD.ORG>; Fri, 13 Jul 2001 13:03:27 -0700 (PDT)
	(envelope-from tlambert2@mindspring.com)
Received: from mindspring.com (dialup-209.245.130.157.Dial1.SanJose1.Level3.net [209.245.130.157])
	by harrier.mail.pas.earthlink.net (EL-8_9_3_3/8.9.3) with ESMTP id NAA12966;
	Fri, 13 Jul 2001 13:03:23 -0700 (PDT)
Message-ID: <3B4F542F.D0D0E0BA@mindspring.com>
Date: Fri, 13 Jul 2001 13:03:59 -0700
From: Terry Lambert <tlambert2@mindspring.com>
Reply-To: tlambert2@mindspring.com
X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony}  (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Leo Bicknell <bicknell@ufp.org>
Cc: freebsd-hackers@FreeBSD.ORG
Subject: Re: Network performance roadmap.
References: <20010713101107.B9559@ussenterprise.ufp.org> <3B4F4534.37D8FC3E@mindspring.com> <20010713151257.A27664@ussenterprise.ufp.org>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-hackers.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-hackers>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-hackers>
X-Loop: FreeBSD.ORG

Leo Bicknell wrote:
> > > B) When the system runs out of MBUF's, really bad things happen.  It
> > >    would be nice to make the system handle MBUF exhaustion in a nicer
> > >    way, or avoid it.
> >
> > The easiest way to do this is to know ahead of time how many
> > you _really_ have.  Then bad things don't happen.
> 
> Clearly not true.  The system knows how many it has today, at compile
> time in fact, and takes no steps to keep them from being exhaused.
> You'll notice I proposed a mechanism to keep them from being exhausted,
> a mechanism that degrades performance in a very gentle manor when the
> limit is reached.

I run a system where I not only allocate the page mappings,
I also allocate the mbufs, at boot time.

When I run out of mbufs, I do not have "bad things happen".

The "bad things" are an artifact of memory overcommit; if you
remove the overcommit, so that the backing pages always exist,
you never have problems.  It is only when you have a page
mapping, with no pages to give it to back it, that you have
a problem.  In other words, the problem does not exist because
the number of mbufs, it exists because you have a mapping for
the page for the allocation, and you do not ensure that there
is a page backing it before you give out an allocation from
the map.

You could fix this in FreeBSD, by making the zone allocator
verify that a backing page is mapped _before_ it gives out
the allocation to the caller, for interrupt zones, where
the allocation is not permitted to sleep by virtue of having
been called in an interrupt handler.  It could then fail the
allocation, instead of returning an allocation in a mapping
for which there was not yet a backing page, and, under
conditions of memory exhaustion (e.g. when you say you have
more mbuf's in your kernel config than you will have backing
pages available to satisfy), leave it for the caller to try
to touch the page, and then panic it's little brains out
when it gets an unsatisfiable fault from touching memory
which it believes was successfully allocated.

Alternately, you could configure your kernel with a small
enough number of mbufs that that situation never arises,
by not lying to it and telling it it has more RAM than it
really has by picking a number of mbufs so large that,
under the maximum user process created dirty page load,
you are going to be unable to satisfy the request.

The only other alternative is to force swap of dirty pages;
to do this, you would have to suspend the network interrupt,
and not reenable incoming network interrupts (which will
all require mbuf allocations in the driver to refill the
receive ring) until you've recovered some pages.  This
screws up (per my previous post) when you are swapping
over the network.  It also screws up when you have no more
local swap (e.g. both swap and memory have been overcommitted,
not just swap).


> > Socket buffers are set at boot time.  Read the code.  Same for
> > maximum number of connections: you can hop around until you
> > are blue in the face from typing "sysctl", but it will not
> > change the number of tcpcb's and inpcb's, etc..  This is an
> > artifact of the allocator.
> 
> Right, and as I said before, these are not a limiting resource.
> The problem is not even a lack of MBUF's (ie, we don't really need
> more) we just need to be more intelligent about how we use them
> per connection.  I'm curious where you got the impression that
> other things need to be changed.  None of the papers, including
> the ones you mention, suggest that other items need to be changed
> to support high bandwidth data connections.

By changing them.  I have servers that can support 1,000,000
concurrent connections.  They are based on FreeBSD running on
4GB memory systems with 2 1Gbit NICs.

This is why all the hand-waving and suggestions for substantial
(and unnecessary, from empirical practice) changes in the
FreeBSD stack is making me so leery.

This is also why I'm suggesting that it be done in a research
setting, before applying the changes to the main line FreeBSD
source tree, and just assuming that they'll work.


> > Having larger transmit windows is really dependent on the
> > type of traffic you expect to serve; in the HTTP case, the
> > studies indicate that the majority of objects served are
> > less than 8k in size.  Most browsers (except Opera) do
> > not suport PIPELINING.
> 
> So we should optimize for HTTP, and tell the people running
> FTP servers, or news serves, or home desktops sharing files
> with friends that "tough, we like big web servers"?

No.  I'm saying that you can't get away from tuning for
expected load without a hell of a lot of work which is not
even being addressed in the context of this discussion.


> Let's find a solution that works for all of the above.

That would be nice; first of all, you will need to get
over your aversion to working on kernel memory allocators
(;-)), since the only way to set things up for variable
loads is to take away the fixed nature of the allocations
which are needed to tune for those loads.  You can't apply
hyseterisis when your allocations are type-stable, and they
"freeze" your memory in a given state for all time.  That's
like saying you want to make a bunch of clay pots, throwing
them, firing them, and then deciding that what you really
wanted was coffe mugs or a statue: once the clay is fired,
you are stuck with the pots.


> > Only after you have proven that some significant fraction
> > of traffic actually ends up hitting the window size limits,
> > should you make this change to FreeBSD proper.
> 
> "Significant fraction" will change with the server you monitor.
> I'll bet, for instance, most all hub news servers hit the per
> window limit on every connection, as they are sending large
> streaming amounts of bulk data.  I bet FTP sites hit the problem
> for well more than 10% of their clients, as the people likely
> to download the 100 Meg demo of XYZ Shoot-Em-Up are unlikely
> to be on a modem.

Well, I never said to run it on one server type.  You are
getting to the point of needing emipirical data on tuning
parameters.  This is no good.  You need the empirical data,
but it should not be applied to tuning parameters globally,
it should be applied to them on a case by case basis on
server installations.

The only way around this is to bite the bullet, and do the
right thing.  Failure to do that means that you are subject
to denial of service attacks based on your tuning parameters,
so while you may run OK in the case of needing a lot of HTTP
connections with small windows, someone can panic your system
by advertising very large windows and then giving you many
2MB HTTP requests.  Normal HTTP requests are not that large,
but your approach means that I can push the window size up
beyond what is normal, in the case that I wish to beat up
your server to get it to run out of mbufs and crash.


> Again, there's a solution here that works for everyone.

If everyone on the internet plays nice, I will agree.


> > One good way to prevent this is to not unreasonably set
> > your window size... 8-p.
> 
> Ah, I see, so to prevent MBUF exhaustion I should not let
> my socket buffers get large.  Sort of like to prevent serious
> injury in a car crash I should drive at 10MPH on the freeway.

Or 55MPH.  Or 65MPH.  Whatever your local limit is, is also
administrative, and quite arbitrary.  Many cars are safe at
much, much faster speeds, as long as someone doesn't decide
to drive at 50MPH in the fast lane, so your rate of closure
is 70MPH+.


> Performance limits to save a system from crashing should be
> a last resort.

It should be the last resort.  But you will need to change
things so that it is _physically impossible_ for someone to
drive 50MPH in the fast lane, or physically impossible for
a 70MPH collision with a stationary object to cause damage.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message