From owner-freebsd-hackers Fri Feb 1 3:53: 3 2002 Delivered-To: freebsd-hackers@freebsd.org Received: from avocet.prod.itd.earthlink.net (avocet.mail.pas.earthlink.net [207.217.120.50]) by hub.freebsd.org (Postfix) with ESMTP id 662C437B402 for ; Fri, 1 Feb 2002 03:52:58 -0800 (PST) Received: from pool0081.cvx40-bradley.dialup.earthlink.net ([216.244.42.81] helo=mindspring.com) by avocet.prod.itd.earthlink.net with esmtp (Exim 3.33 #1) id 16WcF9-0000JM-00; Fri, 01 Feb 2002 03:52:36 -0800 Message-ID: <3C5A817D.11A5117A@mindspring.com> Date: Fri, 01 Feb 2002 03:52:29 -0800 From: Terry Lambert X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Luigi Rizzo Cc: Mike Silbersack , Storms of Perfection , thierry@herbelot.com, replicator@ngs.ru, hackers@FreeBSD.org Subject: Re: Clock Granularity (kernel option HZ) References: <20020131172729.X38382-100000@patrocles.silby.com> <3C59E873.4E8A82B5@mindspring.com> <20020201002339.C48439@iguana.icir.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Luigi Rizzo wrote: > On Thu, Jan 31, 2002 at 04:59:31PM -0800, Terry Lambert wrote: > > You will get a factor of 6 (approximately) improvement in > > throughput vs. overhead if you process packets to completion > > at interrupt, and process writes to completion at write time > > from the process. > > this does not match my numbers. e.g. using "fastforwarding" > (which bypasses netisrs's) improves peak throughput > by a factor between 1.2 and 2 on our test boxes. This isn't the same thing; you are measuring something that is affected, and something that isn't. I'm measuring pool retention time in the HW intr to NETISR queue transfer. I'm talking about the latency in generating the SYN and the ACK, on one side, and the SYN-ACK on the other, when going all the way to a user space application. Basically, most of the latency in a TCP connection is in the latency of waiting for the NETISR to process the packets from the receive queue through the stack, and then the context switch to the user space process. The improvement is in the throughput vs. the overhead -- the amount of time you wait for the NETISR to run is on average half the time between runs, which is HZ dependent. The "between 1.2 and 2" is what you'd expect for the packet processing alone. But for an application like a web server with 1K of static content, where there is a connection, an accept, the request (client write), the server read, the server write, and then the client read, and then the FIN/FIN ACK/ACK, then you'd expect a 1.5 x 2 for both ends = 3, and, if you could do the write path as well, then you could expect 6 (you can't really do the write path, because it's process driven). I was thinking about this with an FTP or SMTP server, where you could piggyback the request data on the ACK for the SYN-ACK from the client to the server, but it's not incredibly practical. Like I said, this isn't a useful improvement, in any case, unless you are running yourself out of memory, and you are much more likely to be doing that in the socket buffers, since it's not going to increase your overall throughput in anything but the single client case, or the connect-and-drop connections-per-second microbenchmark. I haven't set up equipment to test the connections per second rate on gigabit using the SYN cache. I know that by processing the incoming SYN to completion (all the way through the stack, without a cache) at interrupt, it goes from ~7,000 per second on a Tigon III to ~22,000 per second (and 28,000 on a Tigon III). I rather expect the SYN cache to eat up any measurable gains that you could have gotten by upping the HZ -- again, unless you are running out of memory. If you wanted to get around 400,000 connections per second, I think I could get you there with some additional hack-foolery, but of course it's not really a useful metric, IMO. Total number of simultaneous connections is much more useful, in the long run, since that's what arbitrates your real load limits. If you wanted to hack that number higher, that's pretty easy, too. One way would be the way that was suggested on the -arch list a while back, and being even more agressive: turn the SYN cache into a connection cache, and don't full instantiate it even after the ACK, until you get first data. Another way is that there are a lot of elements in the socket structure that are never used simultaneously, and could be reduced via union. Yet another way would be to reduce the kqueue overhead by putting the per object queues into the same bucket, instead of having so mainy TAILQ structures floating around. A final way would be to change the zone allocator to allocate on a sizeof(long) boundary, which for 1,000,000 connections saves a good 128M of memory at one shot. There's a lot of low hanging fruit. Frankly, all the interesting applications have CPU overhead involved, so the trade off on CPU overhead from upping the HZ value is probably a bad trade, anyway (I hinted at that earlier). -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message