Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 04 Jul 2001 13:55:39 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Len Conrad <LConrad@Go2France.com>
Cc:        freebsd-hackers@FreeBSD.ORG, freebsd-stable@FreeBSD.ORG
Subject:   Re: helping Wietse help postfix on FreeBSD
Message-ID:  <3B4382CB.164B607A@mindspring.com>
References:  <5.1.0.14.0.20010703230504.02f8fe50@mail.Go2France.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Len Conrad wrote:
> =

> I=B4m trying to gather tuning information for Wietse Venema who says:
> =

> "I'm writing a document that describes how to crank up FreeBSD so
> that it can run lots of processes, and so that it can handle lots
> of connections.
> =

> Right now, these guidelines vary from sysctl, loader.conf, to
> recompiling a kernel. This is confusing."

This posting will probably piss a lot of people off,
even though it's only a fraction of what you should
actually be looking at, and I'm intentionally ommitting
many things that let me get the numbers I do, until I
can push them even higher out of reach.  Telle est guerre.
8-).

--

MOSTS IMPORTANT POINT: You are going on "Mr. Toad's Wild
Kernel Recompilation Ride"; get over it: sit back, and
you might even enjoy it.


You are going to find you are constrained by memory;
it does not help that FreeBSD has vastly bloated many
structures in support of kqueue and similar things,
instead of instituting unions and reusing fields, and
using muxes for things, instead of individual callbacks.

I don't suggest you rewrite your allocator unless you
know exactly what you are doing; you can still get high
numbers, but nowhere in the ballpark of the numbers
I've been able to get (e.g. 1 million). I do _NOT_
recommend that you try to beat my numbers, unless you
have 16 years kernel experience and about a month to
sift through and understand the code, and another couple
of months to rewrite everything that doesn't scale.

You can increase the KVA space, but the documentation
in the handbook that talks about how to do this is actually
woefully inadequate, since it misses several "magic"
numbers, and fails to give derivation for the others;
I would prefer that the code be fixed, so I'm not going
to document the process here.

Another thing you can do is crank up the maximum number
of open files.  For networking, this _MUST_ be done
before the tcpcb's, struct sockets, and inpcb's have had
space allocated, which means at boot time, if you want
it to actually appy to network connections.  If you tune
this at run time, your connections will remain limited to
the value at boot time.

Because you can not tune this value in loader.conf without
using the patch I posted for /sys/conf/param.c, and since
no one has committed it (people complained about how it was
done, but didn't provide their own code to do it any better),
you are basically stuck with rebuilding your kernel with a
high "MAXFILES" in your config.  Your kernel will use this
value to set somaxconn (and the allocation of sockets),
which in turn will determine the number of inpcb's and
tcpcb's, which will then limit the number of network
connections you can have simultaneously.

If you are using FreeBSD 4.3-RELEASE, you _MUST_ not set
your maximum files above ~32000.  This gives you some
headroom from the maximum value of an unsigned short
reference count in the ucred structure, since ucred's
are inherited, and there is one reference per file
descriptor, and one per socket (personally, I just don't
get the reasoning behind the socket refrence, since the
only way you can access the thing is through a file; the
places where it's tested seem pretty stupid to me).  I
posted a patch for this, too, to -arch and -current; I
think it was effectively brought in as an "MFC" to the
-STABLE branch, but don't quote me: look in cred.h instead.

You can hack your cred structure to use an unsigned long,
instead, but you will have to rebuild *everything*, and
say "hello ports" and "goodbye, packages".

Don't expect to be able to make more than 64k of outbound
connections, unless you are willing to rewrite the port
allocation hashing code, since everything is allocated
out of the same collision domain, unpreterbed by having
multiple IP addresses.  Inbound connections are not a
problem, since they don't use up local ports.  Software
can work around this problem (I routinely load test from
one of my desktop boxes at ~180,000 client connections
from 3 virtual IP addresses), but to do so, you have to
understand _exactly_ how the allocation works, and dance
yourself through all the limiting "if" tests from user
space -- not for the faint of heart (ever play one of
those "milk bottle fishing" games at a county fair?).

MBUFS and NMBCLUSTERS almost go without saying...
You will use one mbuf per connection for keepalives,
and may spend more, if you set non-default options.
Mike posted a patch for this, which gets rid of the
TCPTMPL structure.  This patch helps scaling, but it
turns out that it's an incredible performance lose,
in testing.  Much better to use something like my
chain allocator, and recover 196MB of memory (at a
million connections), and leave the tcptmpl there; it
turns out that there are at least 4 "interesting"
uses for the thing, aside from keepalive.  Oh, yeah:
his patch also kills TCP_COMPAT_42: also an incredibly
bad thing.

I will say: DO NOT USE A BIG "MAXUSERS"!  This cranks up
many things your server couldn't care less about, and
costs you horribly.  Look at what "MAXUSERS" does to
various values, instead, and what options you will need
to use to override them.

You will want to enable TCP_COMPAT_42.  If you don't,
you will find all your sockets pack up eventually
because of the sequence number going backward.  The use
of random ISS was an amazingly stupid idea, and when it
"goes backward", everything goes to hell in the TCP
finite state machine, particularly under load.  No heavy
load server can afford random ISS.  Even an "increase
only" random number significantly reduces the cycle time;
without going to 64 bit sequence numbers, on a 100mbit
link, you are talking less than the TIME_WAIT interval.
I wish "security" people understood TCP/IP better.

I don't think diddling "MAXPROC" will do you much good,
but go ahead, if you plan on fork'ing all over the place
as part of your scaling strategy (you would be much better
off with a finite state automaton, with a small statite
per connection, and kqueue, to interleave all the I/O;
save on context switches that way, too).

BTW: FreeBSD should seperate code and data segment
switching, for when someone fork-boms -- er, scales --
this way.  It will save much TLB shootdown and cache
flushing.

Finally, I'm aware that Postfix does "pig tricks", such
as turning of SO_KEEPALIVE, and perhaps also unsetting the
"always_keepalive" sysctl variable so that the option will
actually do what the man page says it does.  Don't do that:
you will eventually lock up the protocol state machine on
one end or both.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3B4382CB.164B607A>