Date: Thu, 26 Feb 1998 23:26:20 +0100 From: sthaug@nethelp.no To: mike@smith.net.au Cc: hackers@FreeBSD.ORG Subject: Re: "Best" Fast Ethernet Card Message-ID: <27484.888531980@verdi.nethelp.no> In-Reply-To: Your message of "Wed, 25 Feb 1998 18:44:42 -0800" References: <199802260244.SAA21962@dingo.cdrom.com>
next in thread | previous in thread | raw e-mail | index | archive | help
> > One *great* bonus is it will do IP, TCP and UDP checksums automagically > > in hardware! > > Oh great. This card was designed *explicitly* for Windows systems, > where they think it's funny for the network adapter driver to know > enough about the protocol layer to manage junk like this. Probably not. More likely it was simply meant to give lower CPU usage, given the right modifications to the TCP/IP stack. If you check the new Gigabit Ethernet cards that are becoming available, you'll find *most* of them will do IP checksum on-chip. I've included below a recent Usenet article by Craig Partridge which explains some of the things that can be done to speed up BSD TCP/IP. You'll note that he explicitly mentions hardware checksums. Steinar Haug, Nethelp consulting, sthaug@nethelp.no ---------------------------------------------------------------------- From: craigp@world.std.com (Craig Partridge) Subject: Re: BSD TCP/IP stack code; performance improvement Message-ID: <ELLvF4.2F9@world.std.com> Date: Mon, 22 Dec 1997 19:28:15 GMT chuckbo@garnet.vnd.tek.com (Chuck Bolz) writes: >I'm getting ready to "tune" a TCP/IP stack based on 4.3BSD with >numerous 4.4BSD enhancements. I've been testing an echo server >at 100 Mbps, and preliminary profiling indicates the following >breakdown: 50% of CPU time in socket code, 40% in TCP/IP code, >and the remainder in the driver/interrupt stack. This is a lot >of code to analyze! This note gave me an excuse to sit down and write up a little note about known improvements to TCP/UDP/IP performance that have not yet worked their way into the standard 4.3/4.4 BSD sources. This note takes the form of a list of known improvements. Comments on other known improvements are appreciated -- this list is off the top of my head and could use enhancement. Some of these improvements exist freely (for instance, Steve Pink and I have got the sosend() and soreceive() and combined copy/cksum stuff for x386 processor and ought to get them to the FreeBSD and NetBSD folks). Craig Improvement: Replace sosend() Performance Benefit: 5% (see Pink&Partridge 1994) + enables other improvements Sosend() is this horrendously complex bit of code that tries to figure out how the lower layer wants its data laid out and then tries to put the data being sent in that form. In almost every case, the lower layer protocol could do the job faster and more simply (faster because it knows its requirements, more simply because it doesn't have to test for a whole bunch of cases, and thus code is more compact and has less branches). Done wrong, this change requires rewriting the send code for all protocols. Done simply, you just add an pr_sosend entry in the protosw structure and set it to sosend() unless there's a protocol specific routine. NOTE: This change is a pre-requisite for some other performance improvements (such as combined checksum/copy) because sosend() is where data is copied from user space into the kernel. Improvement: Replace soreceive() Performance Benefit: Minor (< 1%) but enables benefits below You can simply soreceive() very slightly by making it protocol specific like sosend(). More important, you enable a bunch of improvements in memory handling. Improvement: Reduce data copies Performance Benefit: Large (10%-25% -- results vary see Partridge&Pink 94) Currently TCP touches its data 3 times, UDP 2 times, on transmission, and similar numbers on receipt. In both cases, the count should be 1 (or 0, with hardware assist). There are two necessary steps here both easy. The easy one is to create a kernel copy routine (typically a version of uiomove() and copyin()/copyout()) that computes the Internet checksum of the data being copied, while doing the copy. Then use this routine in the protocol specific sosend() and soreceive() to move data in and out of the kernel. This change reduces UDP to one copy and TCP to two copies. To reduce TCP to one copy, you need to make sure the device driver doesn't delete the TCP data when a segment is transmitted, so you can point to the same data when retransmitting. To get to zero copies, you need hardware checksumming (done when DMAing to the interface). NOTE: Many of these benefits can also be achieved using Copy-On-Write -- you mark application buffers COW and then don't have to copy them. You still however, need to checksum them, so unless there's hardware checksum support, you still scan the data once. Improvement: Delete IP header checksum call to in_cksum() Performance Benefit: 2% to 8% (depends on packet size and processor - P&P 94) The IP output code calls in_cksum() to checksum the IP header checksum. Since the header checksum requires only 14 instructions (without any conditionals) to compute, this is silly (you'll burn several times 14 instructions calling in_cksum(), plus harm code locality). Better to do the checksum in ip_output. Ditto on input in ipintr() Improvement: Delete IP interrupt Performance Benefit: never measured, estimated to be 20%+ On the inbound side, the networking code goes through two software interrupts, one for IP processing and one for socket processing. Given the high cost of doing the interrupt, the IP processing interrupt should go away -- IP and partial TCP processing should just be done at board interrupt level, then a single interrupt to the socket layer should be made to complete TCP processing. Van Jacobson has done preliminary work here but never gotten it to the point of distribution. Improvement: Get A Better Compiler Performance Benefit: 10% plus There's evidence that compilers that can relocate code segments and adjust branches based on actual profiles (so called Profiler Based Optimization) can easily give 10% performance improvements. Various folks have also done fancier reworking of binary layouts by hand and gotten even better results. (Work at Arizona and UC I believe) Improvement: Fix PCB lookup Performance Benefit: 5% or more Two issues here. First, the PCB caches don't work well, especially for UDP. Second, in_pcblookup() is a linear search -- it should be a hash table (see McKenney's paper in SIGCOMM '90). To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?27484.888531980>