From owner-freebsd-hackers Wed Jun 20 23:14:59 2001 Delivered-To: freebsd-hackers@freebsd.org Received: from scaup.mail.pas.earthlink.net (scaup.mail.pas.earthlink.net [207.217.121.49]) by hub.freebsd.org (Postfix) with ESMTP id 31DEA37B401 for ; Wed, 20 Jun 2001 23:14:53 -0700 (PDT) (envelope-from tlambert2@mindspring.com) Received: from mindspring.com (dialup-209.247.140.53.Dial1.SanJose1.Level3.net [209.247.140.53]) by scaup.mail.pas.earthlink.net (EL-8_9_3_3/8.9.3) with ESMTP id XAA22491; Wed, 20 Jun 2001 23:14:17 -0700 (PDT) Message-ID: <3B3190D9.D38B903D@mindspring.com> Date: Wed, 20 Jun 2001 23:14:49 -0700 From: Terry Lambert Reply-To: tlambert2@mindspring.com X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Rik van Riel Cc: Matt Dillon , "Ashutosh S. Rajekar" , freebsd-hackers@FreeBSD.ORG Subject: Re: max kernel memory References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Rik van Riel wrote: > On Wed, 20 Jun 2001, Matt Dillon wrote: > > > I don't think this represents the biggest problem > > you would face, though. It is far more likely that > > hung or slow connections (e.g. the originator goes > > away without disconnecting the socket or the > > originator is on a slow link) will represent the > > biggest potential problem. It's too bad we can't > > 'swap out' socket buffers! > > Even that wouldn't save you from running into address > space issues with the kernel, unless you replace all > pointers with other kinds of indices ... but that'll > probably make things messy. Not really, though I could see how you'd think that, coming from a Linux background, and given the stack rewrite to the current level, there. The Linux VM system approach is to impose a simplified model, in order to make it easier to be cross-platform; FWIW, I follow Linux developement very, very closely: they tend to implement ideas mentioned on FreeBSD lists by myself and others more quickly than FreeBSD itself does, but tend to do so with a certain lack of academic rigor. The FreeBSD model grew up out of the idea of doing the state of the art implementation possible with hardware assistance (John Dyson's work on the unified VM and buffer cache predated all such non-academic work in all commercial UNIX implementations by almost two years, and included cache coloring, which was a brand new concept, at the time). FreeBSD has grown across Alpha and other platforms by emulating this sophistication in software, on systems where there was not immediately available hardware support. It has a number of locore.s and machdep.c and pmap.c warts that need trimming, but all in all, it is very sophisticated, at the lowest levels. This is _NOT_ intended as a Linux put-down: you have two approaches to a growing kid when it comes to new shoes: buy one size larger, and hope the child will grow into them, letting them walk around with big floppy shoes foor as long as it takes (early [or premature] implementation), or wait until the child out-grows the shoes it currently has, and starts to have problems with in-grown toenails ([hopefully] just in time implementation [sometimes it means bare feet for the summer]). Back to swapping socket structures... You could swap them if you wanted to give up some KVA space to be able to do it. The ipcb and tcpcb alloc's are done when they are done to permit swapping, which leaves sockets and templates as the major bugaboos. I personally do not think that that is worth it: the architecture you are suggesting is a strawman, and it represents a poor design for scaling, unless you are going to bite the bullet and use a 16M segmented AMD processor to give yourself more KVA space. For slow connections, you can delay instantiation of the actual socket; Ashutosh suggested this a short time ago. In fact, the OpenBSD, NetBSD, and BSDI code all support this today, in the form of a "SYN cache"; in Ashutosh's suggestion, he wanted to be somewhat more aggressive; instead of caching the SYN until the first ACK, he suggested caching the SYN, ACK, and SYNACK until the first data. Note that a "SYN" cache was intended to aid in doing load-shedding and increasing the resistance to the SYN-flood attacks: the existing implementations were intended for those reasons. The more aggressive method proposed allows load scaling. Either approach increases latency, but at those load levels, you probably care more about scaling. There are also other, more modern techniques. Ashutosh implied that the NetScaler box does layer 2 forwarding (this is not the correct technical name for it); from their description of their "Patent Pending IMP technology" (for which I think it's possible to demonstrate prior art back as far as 1996: the technical reports are available on the web), they really need to do connection aggregation, which can't be done, without locally terminating the TCP endpoints. I think their "millions of connections" equals a number of boxes ganged together to get to that level, or they have purpose-built hardware to do the work; perhaps that's why they are supposedly "hurting", though I've seen no evidence of that (or against it), unless their job listings are meaningful. There's code to do much of this already (much of it from commercial work that's already completed), and a lot of it is going to find its way back into FreeBSD, if FreeBSD wants it. There are one or two experimental Linux versions that do it as well, for which I've only seen the technical reports, and for which the authors are being really circumspect on releasing the code (if you were an academic who wanted to make money after demonstrating a good idea, and thought implementation would be a barrier to competition, you'd publish to get venture funding, but port the code to some place you didn't have to give source out, too). The really fundamental problems with FreeBSD at this point devolve down to some moderately easily repaired historical artifacts in its VM architecture and allocation techniques and policies, as well as administrative limits for "general purpose" use being the defaults, with no way to "autotune" based on workload. Most of the fixes have been known in the literature since the early and mid 1990's (though some are more recent). Even if you "autotuned", you would run into the default administrative limits that most people would be unhappy changing, since it would make the system very poor interactively under a heterogeneous workload. Most of the tuning that people seem to want is for homogeneous workloads, where there are a small number of programs, but maybe a large number of instances. Things in this category include benchmarks, Apache or mail servers, etc. -- role based dedicated boxes, for which the administrative limits make considerably less sense. Right now, to get those, you have to know what you are doing and what works and why, and tune. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message