From owner-freebsd-stable Wed Jul 4 13:55:24 2001 Delivered-To: freebsd-stable@freebsd.org Received: from avocet.mail.pas.earthlink.net (avocet.mail.pas.earthlink.net [207.217.121.50]) by hub.freebsd.org (Postfix) with ESMTP id B813637B401; Wed, 4 Jul 2001 13:55:11 -0700 (PDT) (envelope-from tlambert2@mindspring.com) Received: from mindspring.com (dialup-209.247.142.206.Dial1.SanJose1.Level3.net [209.247.142.206]) by avocet.mail.pas.earthlink.net (EL-8_9_3_3/8.9.3) with ESMTP id NAA13166; Wed, 4 Jul 2001 13:55:05 -0700 (PDT) Message-ID: <3B4382CB.164B607A@mindspring.com> Date: Wed, 04 Jul 2001 13:55:39 -0700 From: Terry Lambert Reply-To: tlambert2@mindspring.com X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Len Conrad Cc: freebsd-hackers@FreeBSD.ORG, freebsd-stable@FreeBSD.ORG Subject: Re: helping Wietse help postfix on FreeBSD References: <5.1.0.14.0.20010703230504.02f8fe50@mail.Go2France.com> Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-freebsd-stable@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Len Conrad wrote: > = > I=B4m trying to gather tuning information for Wietse Venema who says: > = > "I'm writing a document that describes how to crank up FreeBSD so > that it can run lots of processes, and so that it can handle lots > of connections. > = > Right now, these guidelines vary from sysctl, loader.conf, to > recompiling a kernel. This is confusing." This posting will probably piss a lot of people off, even though it's only a fraction of what you should actually be looking at, and I'm intentionally ommitting many things that let me get the numbers I do, until I can push them even higher out of reach. Telle est guerre. 8-). -- MOSTS IMPORTANT POINT: You are going on "Mr. Toad's Wild Kernel Recompilation Ride"; get over it: sit back, and you might even enjoy it. You are going to find you are constrained by memory; it does not help that FreeBSD has vastly bloated many structures in support of kqueue and similar things, instead of instituting unions and reusing fields, and using muxes for things, instead of individual callbacks. I don't suggest you rewrite your allocator unless you know exactly what you are doing; you can still get high numbers, but nowhere in the ballpark of the numbers I've been able to get (e.g. 1 million). I do _NOT_ recommend that you try to beat my numbers, unless you have 16 years kernel experience and about a month to sift through and understand the code, and another couple of months to rewrite everything that doesn't scale. You can increase the KVA space, but the documentation in the handbook that talks about how to do this is actually woefully inadequate, since it misses several "magic" numbers, and fails to give derivation for the others; I would prefer that the code be fixed, so I'm not going to document the process here. Another thing you can do is crank up the maximum number of open files. For networking, this _MUST_ be done before the tcpcb's, struct sockets, and inpcb's have had space allocated, which means at boot time, if you want it to actually appy to network connections. If you tune this at run time, your connections will remain limited to the value at boot time. Because you can not tune this value in loader.conf without using the patch I posted for /sys/conf/param.c, and since no one has committed it (people complained about how it was done, but didn't provide their own code to do it any better), you are basically stuck with rebuilding your kernel with a high "MAXFILES" in your config. Your kernel will use this value to set somaxconn (and the allocation of sockets), which in turn will determine the number of inpcb's and tcpcb's, which will then limit the number of network connections you can have simultaneously. If you are using FreeBSD 4.3-RELEASE, you _MUST_ not set your maximum files above ~32000. This gives you some headroom from the maximum value of an unsigned short reference count in the ucred structure, since ucred's are inherited, and there is one reference per file descriptor, and one per socket (personally, I just don't get the reasoning behind the socket refrence, since the only way you can access the thing is through a file; the places where it's tested seem pretty stupid to me). I posted a patch for this, too, to -arch and -current; I think it was effectively brought in as an "MFC" to the -STABLE branch, but don't quote me: look in cred.h instead. You can hack your cred structure to use an unsigned long, instead, but you will have to rebuild *everything*, and say "hello ports" and "goodbye, packages". Don't expect to be able to make more than 64k of outbound connections, unless you are willing to rewrite the port allocation hashing code, since everything is allocated out of the same collision domain, unpreterbed by having multiple IP addresses. Inbound connections are not a problem, since they don't use up local ports. Software can work around this problem (I routinely load test from one of my desktop boxes at ~180,000 client connections from 3 virtual IP addresses), but to do so, you have to understand _exactly_ how the allocation works, and dance yourself through all the limiting "if" tests from user space -- not for the faint of heart (ever play one of those "milk bottle fishing" games at a county fair?). MBUFS and NMBCLUSTERS almost go without saying... You will use one mbuf per connection for keepalives, and may spend more, if you set non-default options. Mike posted a patch for this, which gets rid of the TCPTMPL structure. This patch helps scaling, but it turns out that it's an incredible performance lose, in testing. Much better to use something like my chain allocator, and recover 196MB of memory (at a million connections), and leave the tcptmpl there; it turns out that there are at least 4 "interesting" uses for the thing, aside from keepalive. Oh, yeah: his patch also kills TCP_COMPAT_42: also an incredibly bad thing. I will say: DO NOT USE A BIG "MAXUSERS"! This cranks up many things your server couldn't care less about, and costs you horribly. Look at what "MAXUSERS" does to various values, instead, and what options you will need to use to override them. You will want to enable TCP_COMPAT_42. If you don't, you will find all your sockets pack up eventually because of the sequence number going backward. The use of random ISS was an amazingly stupid idea, and when it "goes backward", everything goes to hell in the TCP finite state machine, particularly under load. No heavy load server can afford random ISS. Even an "increase only" random number significantly reduces the cycle time; without going to 64 bit sequence numbers, on a 100mbit link, you are talking less than the TIME_WAIT interval. I wish "security" people understood TCP/IP better. I don't think diddling "MAXPROC" will do you much good, but go ahead, if you plan on fork'ing all over the place as part of your scaling strategy (you would be much better off with a finite state automaton, with a small statite per connection, and kqueue, to interleave all the I/O; save on context switches that way, too). BTW: FreeBSD should seperate code and data segment switching, for when someone fork-boms -- er, scales -- this way. It will save much TLB shootdown and cache flushing. Finally, I'm aware that Postfix does "pig tricks", such as turning of SO_KEEPALIVE, and perhaps also unsetting the "always_keepalive" sysctl variable so that the option will actually do what the man page says it does. Don't do that: you will eventually lock up the protocol state machine on one end or both. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-stable" in the body of the message