From owner-freebsd-hackers Mon Mar 9 07:37:38 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id HAA00209 for freebsd-hackers-outgoing; Mon, 9 Mar 1998 07:37:38 -0800 (PST) (envelope-from owner-freebsd-hackers@FreeBSD.ORG) Received: from isvara.net (root@[130.88.148.77]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id HAA00196 for ; Mon, 9 Mar 1998 07:37:31 -0800 (PST) (envelope-from freebsd@challenge.isvara.net) Received: from challenge.isvara.net ([130.88.66.5]) by isvara.net (8.8.7/8.8.7) with ESMTP id PAA02838 for ; Mon, 9 Mar 1998 15:35:56 GMT Message-ID: <35040C37.4D578C02@challenge.isvara.net> Date: Mon, 09 Mar 1998 15:35:20 +0000 From: freebsd@isvara.net X-Mailer: Mozilla 4.04 [en] (Win95; I) MIME-Version: 1.0 To: FreeBSD Hackers Subject: Re: kernel wishlist for web server performance References: <199803090521.WAA04154@pencil-box.village.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Warner Losh wrote: > Then don't you lose performance setting up two DMA transfers for the > packet? Or do most cards have scatter/gather operations for > transmitting packets? Most (if not all) newer NICs which support DMA tranfer support scatter/gather lists. Another nice feature is protocol cheksumming in hardware. It's a shame all these tweaks for the internet stack aren't being worked on (like copy avoidance, h/w checksum support), and the list below: Cheers, Dan chuckbo@garnet.vnd.tek.com (Chuck Bolz) writes: >I'm getting ready to "tune" a TCP/IP stack based on 4.3BSD with >numerous 4.4BSD enhancements. I've been testing an echo server >at 100 Mbps, and preliminary profiling indicates the following >breakdown: 50% of CPU time in socket code, 40% in TCP/IP code, >and the remainder in the driver/interrupt stack. This is a lot >of code to analyze! This note gave me an excuse to sit down and write up a little note about known improvements to TCP/UDP/IP performance that have not yet worked their way into the standard 4.3/4.4 BSD sources. This note takes the form of a list of known improvements. Comments on other known improvements are appreciated -- this list is off the top of my head and could use enhancement. Some of these improvements exist freely (for instance, Steve Pink and I have got the sosend() and soreceive() and combined copy/cksum stuff for x386 processor and ought to get them to the FreeBSD and NetBSD folks). Craig Improvement: Replace sosend() Performance Benefit: 5% (see Pink&Partridge 1994) + enables other improvements Sosend() is this horrendously complex bit of code that tries to figure out how the lower layer wants its data laid out and then tries to put the data being sent in that form. In almost every case, the lower layer protocol could do the job faster and more simply (faster because it knows its requirements, more simply because it doesn't have to test for a whole bunch of cases, and thus code is more compact and has less branches). Done wrong, this change requires rewriting the send code for all protocols. Done simply, you just add an pr_sosend entry in the protosw structure and set it to sosend() unless there's a protocol specific routine. NOTE: This change is a pre-requisite for some other performance improvements (such as combined checksum/copy) because sosend() is where data is copied from user space into the kernel. Improvement: Replace soreceive() Performance Benefit: Minor (< 1%) but enables benefits below You can simply soreceive() very slightly by making it protocol specific like sosend(). More important, you enable a bunch of improvements in memory handling. Improvement: Reduce data copies Performance Benefit: Large (10%-25% -- results vary see Partridge&Pink 94) Currently TCP touches its data 3 times, UDP 2 times, on transmission, and similar numbers on receipt. In both cases, the count should be 1 (or 0, with hardware assist). There are two necessary steps here both easy. The easy one is to create a kernel copy routine (typically a version of uiomove() and copyin()/copyout()) that computes the Internet checksum of the data being copied, while doing the copy. Then use this routine in the protocol specific sosend() and soreceive() to move data in and out of the kernel. This change reduces UDP to one copy and TCP to two copies. To reduce TCP to one copy, you need to make sure the device driver doesn't delete the TCP data when a segment is transmitted, so you can point to the same data when retransmitting. To get to zero copies, you need hardware checksumming (done when DMAing to the interface). NOTE: Many of these benefits can also be achieved using Copy-On-Write -- you mark application buffers COW and then don't have to copy them. You still however, need to checksum them, so unless there's hardware checksum support, you still scan the data once. Improvement: Delete IP header checksum call to in_cksum() Performance Benefit: 2% to 8% (depends on packet size and processor - P&P 94) The IP output code calls in_cksum() to checksum the IP header checksum. Since the header checksum requires only 14 instructions (without any conditionals) to compute, this is silly (you'll burn several times 14 instructions calling in_cksum(), plus harm code locality). Better to do the checksum in ip_output. Ditto on input in ipintr() Improvement: Delete IP interrupt Performance Benefit: never measured, estimated to be 20%+ On the inbound side, the networking code goes through two software interrupts, one for IP processing and one for socket processing. Given the high cost of doing the interrupt, the IP processing interrupt should go away -- IP and partial TCP processing should just be done at board interrupt level, then a single interrupt to the socket layer should be made to complete TCP processing. Van Jacobson has done preliminary work here but never gotten it to the point of distribution. Improvement: Get A Better Compiler Performance Benefit: 10% plus There's evidence that compilers that can relocate code segments and adjust branches based on actual profiles (so called Profiler Based Optimization) can easily give 10% performance improvements. Various folks have also done fancier reworking of binary layouts by hand and gotten even better results. (Work at Arizona and UC I believe) Improvement: Fix PCB lookup Performance Benefit: 5% or more Two issues here. First, the PCB caches don't work well, especially for UDP. Second, in_pcblookup() is a linear search -- it should be a hash table (see McKenney's paper in SIGCOMM '90). I'd love to see some of these tuneups appear in the internet stack, as they would be of such a great benefit to pretty much all network-related servers. Dan _____________________________________ Daniel J Blueman BSc Computation, UMIST, Manchester Email: blue@challenge.isvara.net Web: http://www.challenge.isvara.net/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message