From owner-freebsd-hackers Fri Jul 13 13: 3:36 2001 Delivered-To: freebsd-hackers@freebsd.org Received: from harrier.mail.pas.earthlink.net (harrier.mail.pas.earthlink.net [207.217.121.12]) by hub.freebsd.org (Postfix) with ESMTP id BF5A137B401 for ; Fri, 13 Jul 2001 13:03:27 -0700 (PDT) (envelope-from tlambert2@mindspring.com) Received: from mindspring.com (dialup-209.245.130.157.Dial1.SanJose1.Level3.net [209.245.130.157]) by harrier.mail.pas.earthlink.net (EL-8_9_3_3/8.9.3) with ESMTP id NAA12966; Fri, 13 Jul 2001 13:03:23 -0700 (PDT) Message-ID: <3B4F542F.D0D0E0BA@mindspring.com> Date: Fri, 13 Jul 2001 13:03:59 -0700 From: Terry Lambert Reply-To: tlambert2@mindspring.com X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Leo Bicknell Cc: freebsd-hackers@FreeBSD.ORG Subject: Re: Network performance roadmap. References: <20010713101107.B9559@ussenterprise.ufp.org> <3B4F4534.37D8FC3E@mindspring.com> <20010713151257.A27664@ussenterprise.ufp.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Leo Bicknell wrote: > > > B) When the system runs out of MBUF's, really bad things happen. It > > > would be nice to make the system handle MBUF exhaustion in a nicer > > > way, or avoid it. > > > > The easiest way to do this is to know ahead of time how many > > you _really_ have. Then bad things don't happen. > > Clearly not true. The system knows how many it has today, at compile > time in fact, and takes no steps to keep them from being exhaused. > You'll notice I proposed a mechanism to keep them from being exhausted, > a mechanism that degrades performance in a very gentle manor when the > limit is reached. I run a system where I not only allocate the page mappings, I also allocate the mbufs, at boot time. When I run out of mbufs, I do not have "bad things happen". The "bad things" are an artifact of memory overcommit; if you remove the overcommit, so that the backing pages always exist, you never have problems. It is only when you have a page mapping, with no pages to give it to back it, that you have a problem. In other words, the problem does not exist because the number of mbufs, it exists because you have a mapping for the page for the allocation, and you do not ensure that there is a page backing it before you give out an allocation from the map. You could fix this in FreeBSD, by making the zone allocator verify that a backing page is mapped _before_ it gives out the allocation to the caller, for interrupt zones, where the allocation is not permitted to sleep by virtue of having been called in an interrupt handler. It could then fail the allocation, instead of returning an allocation in a mapping for which there was not yet a backing page, and, under conditions of memory exhaustion (e.g. when you say you have more mbuf's in your kernel config than you will have backing pages available to satisfy), leave it for the caller to try to touch the page, and then panic it's little brains out when it gets an unsatisfiable fault from touching memory which it believes was successfully allocated. Alternately, you could configure your kernel with a small enough number of mbufs that that situation never arises, by not lying to it and telling it it has more RAM than it really has by picking a number of mbufs so large that, under the maximum user process created dirty page load, you are going to be unable to satisfy the request. The only other alternative is to force swap of dirty pages; to do this, you would have to suspend the network interrupt, and not reenable incoming network interrupts (which will all require mbuf allocations in the driver to refill the receive ring) until you've recovered some pages. This screws up (per my previous post) when you are swapping over the network. It also screws up when you have no more local swap (e.g. both swap and memory have been overcommitted, not just swap). > > Socket buffers are set at boot time. Read the code. Same for > > maximum number of connections: you can hop around until you > > are blue in the face from typing "sysctl", but it will not > > change the number of tcpcb's and inpcb's, etc.. This is an > > artifact of the allocator. > > Right, and as I said before, these are not a limiting resource. > The problem is not even a lack of MBUF's (ie, we don't really need > more) we just need to be more intelligent about how we use them > per connection. I'm curious where you got the impression that > other things need to be changed. None of the papers, including > the ones you mention, suggest that other items need to be changed > to support high bandwidth data connections. By changing them. I have servers that can support 1,000,000 concurrent connections. They are based on FreeBSD running on 4GB memory systems with 2 1Gbit NICs. This is why all the hand-waving and suggestions for substantial (and unnecessary, from empirical practice) changes in the FreeBSD stack is making me so leery. This is also why I'm suggesting that it be done in a research setting, before applying the changes to the main line FreeBSD source tree, and just assuming that they'll work. > > Having larger transmit windows is really dependent on the > > type of traffic you expect to serve; in the HTTP case, the > > studies indicate that the majority of objects served are > > less than 8k in size. Most browsers (except Opera) do > > not suport PIPELINING. > > So we should optimize for HTTP, and tell the people running > FTP servers, or news serves, or home desktops sharing files > with friends that "tough, we like big web servers"? No. I'm saying that you can't get away from tuning for expected load without a hell of a lot of work which is not even being addressed in the context of this discussion. > Let's find a solution that works for all of the above. That would be nice; first of all, you will need to get over your aversion to working on kernel memory allocators (;-)), since the only way to set things up for variable loads is to take away the fixed nature of the allocations which are needed to tune for those loads. You can't apply hyseterisis when your allocations are type-stable, and they "freeze" your memory in a given state for all time. That's like saying you want to make a bunch of clay pots, throwing them, firing them, and then deciding that what you really wanted was coffe mugs or a statue: once the clay is fired, you are stuck with the pots. > > Only after you have proven that some significant fraction > > of traffic actually ends up hitting the window size limits, > > should you make this change to FreeBSD proper. > > "Significant fraction" will change with the server you monitor. > I'll bet, for instance, most all hub news servers hit the per > window limit on every connection, as they are sending large > streaming amounts of bulk data. I bet FTP sites hit the problem > for well more than 10% of their clients, as the people likely > to download the 100 Meg demo of XYZ Shoot-Em-Up are unlikely > to be on a modem. Well, I never said to run it on one server type. You are getting to the point of needing emipirical data on tuning parameters. This is no good. You need the empirical data, but it should not be applied to tuning parameters globally, it should be applied to them on a case by case basis on server installations. The only way around this is to bite the bullet, and do the right thing. Failure to do that means that you are subject to denial of service attacks based on your tuning parameters, so while you may run OK in the case of needing a lot of HTTP connections with small windows, someone can panic your system by advertising very large windows and then giving you many 2MB HTTP requests. Normal HTTP requests are not that large, but your approach means that I can push the window size up beyond what is normal, in the case that I wish to beat up your server to get it to run out of mbufs and crash. > Again, there's a solution here that works for everyone. If everyone on the internet plays nice, I will agree. > > One good way to prevent this is to not unreasonably set > > your window size... 8-p. > > Ah, I see, so to prevent MBUF exhaustion I should not let > my socket buffers get large. Sort of like to prevent serious > injury in a car crash I should drive at 10MPH on the freeway. Or 55MPH. Or 65MPH. Whatever your local limit is, is also administrative, and quite arbitrary. Many cars are safe at much, much faster speeds, as long as someone doesn't decide to drive at 50MPH in the fast lane, so your rate of closure is 70MPH+. > Performance limits to save a system from crashing should be > a last resort. It should be the last resort. But you will need to change things so that it is _physically impossible_ for someone to drive 50MPH in the fast lane, or physically impossible for a 70MPH collision with a stationary object to cause damage. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message