From owner-freebsd-hackers Thu Nov 21 17:10:56 2002 Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9B91E37B401 for ; Thu, 21 Nov 2002 17:10:54 -0800 (PST) Received: from flamingo.mail.pas.earthlink.net (flamingo.mail.pas.earthlink.net [207.217.120.232]) by mx1.FreeBSD.org (Postfix) with ESMTP id 30A7E43E42 for ; Thu, 21 Nov 2002 17:10:54 -0800 (PST) (envelope-from tlambert2@mindspring.com) Received: from pool0176.cvx22-bradley.dialup.earthlink.net ([209.179.198.176] helo=mindspring.com) by flamingo.mail.pas.earthlink.net with esmtp (Exim 3.33 #1) id 18F2LL-0006Bh-00; Thu, 21 Nov 2002 17:10:52 -0800 Message-ID: <3DDD83CA.4A910E59@mindspring.com> Date: Thu, 21 Nov 2002 17:09:30 -0800 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Nate Lawson Cc: hackers@freebsd.org Subject: Re: Changing socket buffer timeout to a u_long? References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Nate Lawson wrote: > On Thu, 21 Nov 2002, Terry Lambert wrote: > > FWIW: upping the roll-over rate is not a good reason to increase > > the size of fields, unless you want to increase the TCP sequence > > number filed to 64 bits? ...it has exactly the same issues at > > high data rates. > > That's what the timestamp option does and I think it was a good idea, > given the range of systems TCP needs to work well on. Setting your HZ to 100,000 instead of 100, and then complaining because a timer field with a resolution specified in ticks instead of an interval length can't handle a value which is way to large for a fast transport seems a bit silly to me. Call me crazy, but the timer field should not be in ticks; in fact, the timer field really should not exist, per se, it should be on a fixed interval timer queue, instead of linked into a callout wheel, and then if it fires, it fires, along with every other timer of that interval. At the very least, if you are going to crank the HZ so that things you multiply by HZ overflow their fields, maybe it's time to scale those fields by some factor in addition to HZ, rather than bloating everything? The thing is already an int in -current; jumping it larger makes no sense at all to me, unless you are being paid to screw over FreeBSD by decreasing the high end load it can scale to, for no good reason. Unless you have a good reason these fields should not be scaled in terms of MSL instead of HZ ticks, for example? When I was originally chasing 1,000,000 simultaneous TCP connections on a single 4G RAM FreeBSD box, one of the biggest and most obvious bottlenecks that I never dealt with is that when FreeBSD moved from the historical fixed interval timer list code to the callout wheel, it really screwed over the TCP timer code bigtime: the overhead went way, way up, and "just increase the size of the callout wheel" only works up to the point where "entries *2 * HZ > MSL". Eventually, I got to the point I could support 1.6M simultaneous TCP connections on a single FreeBSD box with 4G of RAM -- 800,000 load balanced clients against a back end server farm, if the data was simply switched through at L4 -- but most of the excess time that was in the code was in the timer code, traversing obvious misses through the wheel lists on each timer firing. This was because the lists were not -- *could not be* -- ordered, such that you could stop traversing on the first "later than now" entry, because the lists were not fixed interval (as they were in older releases of BSD). The crap doesn't scale, and piling more crap on top of it, at the added expense of making it not scale *even worse* is not the way to fix the problem. PS: Adding *any* TCP options is bad karma, for networking equipment; the cost in terms of in transit overhead is immense, if you are trying to use the code later to build a switch or a load balancer. Doing that sort of thing is fine -- as long as you know beforehand that what you are doing is making the code less general purpose, and everyone buys into that idea. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message