From owner-freebsd-hackers  Wed Jun 20 23:14:59 2001
Delivered-To: freebsd-hackers@freebsd.org
Received: from scaup.mail.pas.earthlink.net (scaup.mail.pas.earthlink.net [207.217.121.49])
	by hub.freebsd.org (Postfix) with ESMTP id 31DEA37B401
	for <freebsd-hackers@FreeBSD.ORG>; Wed, 20 Jun 2001 23:14:53 -0700 (PDT)
	(envelope-from tlambert2@mindspring.com)
Received: from mindspring.com (dialup-209.247.140.53.Dial1.SanJose1.Level3.net [209.247.140.53])
	by scaup.mail.pas.earthlink.net (EL-8_9_3_3/8.9.3) with ESMTP id XAA22491;
	Wed, 20 Jun 2001 23:14:17 -0700 (PDT)
Message-ID: <3B3190D9.D38B903D@mindspring.com>
Date: Wed, 20 Jun 2001 23:14:49 -0700
From: Terry Lambert <tlambert2@mindspring.com>
Reply-To: tlambert2@mindspring.com
X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony}  (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Rik van Riel <riel@conectiva.com.br>
Cc: Matt Dillon <dillon@earth.backplane.com>,
	"Ashutosh S. Rajekar" <asr@softhome.net>, freebsd-hackers@FreeBSD.ORG
Subject: Re: max kernel memory
References: <Pine.LNX.4.33.0106201904480.1376-100000@duckman.distro.conectiva>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-hackers.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo?subject=subscribe%20freebsd-hackers>
List-Unsubscribe: <mailto:majordomo?subject=unsubscribe%20freebsd-hackers>
X-Loop: FreeBSD.ORG

Rik van Riel wrote:
> On Wed, 20 Jun 2001, Matt Dillon wrote:
> 
> >     I don't think this represents the biggest problem
> >     you would face, though.  It is far more likely that
> >     hung or slow connections (e.g. the originator goes
> >     away without disconnecting the socket or the
> >     originator is on a slow link) will represent the
> >     biggest potential problem.  It's too bad we can't
> >     'swap out' socket buffers!
> 
> Even that wouldn't save you from running into address
> space issues with the kernel, unless you replace all
> pointers with other kinds of indices ... but that'll
> probably make things messy.

Not really, though I could see how you'd think that,
coming from a Linux background, and given the stack
rewrite to the current level, there.

The Linux VM system approach is to impose a simplified
model, in order to make it easier to be cross-platform;
FWIW, I follow Linux developement very, very closely:
they tend to implement ideas mentioned on FreeBSD lists
by myself and others more quickly than FreeBSD itself
does, but tend to do so with a certain lack of academic
rigor.

The FreeBSD model grew up out of the idea of doing the
state of the art implementation possible with hardware
assistance (John Dyson's work on the unified VM and
buffer cache predated all such non-academic work in
all commercial UNIX implementations by almost two years,
and included cache coloring, which was a brand new
concept, at the time).  FreeBSD has grown across Alpha
and other platforms by emulating this sophistication in
software, on systems where there was not immediately
available hardware support.  It has a number of locore.s
and machdep.c and pmap.c warts that need trimming, but
all in all, it is very sophisticated, at the lowest
levels.

This is _NOT_ intended as a Linux put-down: you have
two approaches to a growing kid when it comes to new
shoes: buy one size larger, and hope the child will
grow into them, letting them walk around with big
floppy shoes foor as long as it takes (early [or
premature] implementation), or wait until the child
out-grows the shoes it currently has, and starts to
have problems with in-grown toenails ([hopefully]
just in time implementation [sometimes it means bare
feet for the summer]).

Back to swapping socket structures...

You could swap them if you wanted to give up some KVA
space to be able to do it.  The ipcb and tcpcb alloc's
are done when they are done to permit swapping, which
leaves sockets and templates as the major bugaboos.

I personally do not think that that is worth it: the
architecture you are suggesting is a strawman, and it
represents a poor design for scaling, unless you are
going to bite the bullet and use a 16M segmented AMD
processor to give yourself more KVA space.


For slow connections, you can delay instantiation of
the actual socket; Ashutosh suggested this a short time
ago.  In fact, the OpenBSD, NetBSD, and BSDI code all
support this today, in the form of a "SYN cache"; in
Ashutosh's suggestion, he wanted to be somewhat more
aggressive; instead of caching the SYN until the first
ACK, he suggested caching the SYN, ACK, and SYNACK
until the first data.

Note that a "SYN" cache was intended to aid in doing
load-shedding and increasing the resistance to the
SYN-flood attacks: the existing implementations were
intended for those reasons.  The more aggressive
method proposed allows load scaling.  Either approach
increases latency, but at those load levels, you
probably care more about scaling.

There are also other, more modern techniques.  Ashutosh
implied that the NetScaler box does layer 2 forwarding
(this is not the correct technical name for it); from
their description of their "Patent Pending IMP technology"
(for which I think it's possible to demonstrate prior art
back as far as 1996: the technical reports are available
on the web), they really need to do connection aggregation,
which can't be done, without locally terminating the TCP
endpoints.  I think their "millions of connections" equals
a number of boxes ganged together to get to that level,
or they have purpose-built hardware to do the work;
perhaps that's why they are supposedly "hurting", though
I've seen no evidence of that (or against it), unless
their job listings are meaningful.

There's code to do much of this already (much of it from
commercial work that's already completed), and a lot of it
is going to find its way back into FreeBSD, if FreeBSD
wants it.  There are one or two experimental Linux
versions that do it as well, for which I've only seen the
technical reports, and for which the authors are being
really circumspect on releasing the code (if you were an
academic who wanted to make money after demonstrating a
good idea, and thought implementation would be a barrier
to competition, you'd publish to get venture funding, but
port the code to some place you didn't have to give source
out, too).


The really fundamental problems with FreeBSD at this
point devolve down to some moderately easily repaired
historical artifacts in its VM architecture and allocation
techniques and policies, as well as administrative
limits for "general purpose" use being the defaults,
with no way to "autotune" based on workload.  Most of
the fixes have been known in the literature since the
early and mid 1990's (though some are more recent).

Even if you "autotuned", you would run into the default
administrative limits that most people would be unhappy
changing, since it would make the system very poor
interactively under a heterogeneous workload.

Most of the tuning that people seem to want is for
homogeneous workloads, where there are a small number
of programs, but maybe a large number of instances.
Things in this category include benchmarks, Apache or
mail servers, etc. -- role based dedicated boxes, for
which the administrative limits make considerably less
sense.

Right now, to get those, you have to know what you are
doing and what works and why, and tune.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message