From owner-freebsd-hackers  Mon Mar  9 07:37:38 1998
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id HAA00209
          for freebsd-hackers-outgoing; Mon, 9 Mar 1998 07:37:38 -0800 (PST)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from isvara.net (root@[130.88.148.77])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id HAA00196
          for <hackers@freebsd.org>; Mon, 9 Mar 1998 07:37:31 -0800 (PST)
          (envelope-from freebsd@challenge.isvara.net)
Received: from challenge.isvara.net ([130.88.66.5])
	by isvara.net (8.8.7/8.8.7) with ESMTP id PAA02838
	for <hackers@freebsd.org>; Mon, 9 Mar 1998 15:35:56 GMT
Message-ID: <35040C37.4D578C02@challenge.isvara.net>
Date: Mon, 09 Mar 1998 15:35:20 +0000
From: freebsd@isvara.net
X-Mailer: Mozilla 4.04 [en] (Win95; I)
MIME-Version: 1.0
To: FreeBSD Hackers <hackers@FreeBSD.ORG>
Subject: Re: kernel wishlist for web server performance
References: <Pine.BSF.3.95.980308083137.2799W-100000@alive.znep.com> <199803090521.WAA04154@pencil-box.village.org>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

Warner Losh wrote:

> Then don't you lose performance setting up two DMA transfers for the
> packet?  Or do most cards have scatter/gather operations for
> transmitting packets?

Most (if not all) newer NICs which support DMA tranfer support scatter/gather lists.
Another nice feature is protocol cheksumming in hardware. It's a shame all these tweaks
for the internet stack aren't being worked on (like copy avoidance, h/w checksum support),
and the list below:

Cheers,
    Dan

<nice_tweak_list>
chuckbo@garnet.vnd.tek.com (Chuck Bolz) writes:

>I'm getting ready to "tune" a TCP/IP stack based on 4.3BSD with
>numerous 4.4BSD enhancements.  I've been testing an echo server
>at 100 Mbps, and preliminary profiling indicates the following
>breakdown: 50% of CPU time in socket code, 40% in TCP/IP code,
>and the remainder in the driver/interrupt stack.  This is a lot
>of code to analyze!

This note gave me an excuse to sit down and write up a little note about
known improvements to TCP/UDP/IP performance that have not yet worked their
way into the standard 4.3/4.4 BSD sources.  This note takes the form of
a list of known improvements.

Comments on other known improvements are appreciated -- this list is off
the top of my head and could use enhancement.

Some of these improvements exist freely (for instance, Steve Pink and
I have got the sosend() and soreceive() and combined copy/cksum stuff
for x386 processor and ought to get them to the FreeBSD and NetBSD folks).

Craig

Improvement: Replace sosend()
Performance Benefit: 5% (see Pink&Partridge 1994) + enables other improvements

    Sosend() is this horrendously complex bit of code that tries to figure
    out how the lower layer wants its data laid out and then tries to put
    the data being sent in that form.

    In almost every case, the lower layer protocol could do the job faster
    and more simply (faster because it knows its requirements, more simply
    because it doesn't have to test for a whole bunch of cases, and thus
    code is more compact and has less branches).

    Done wrong, this change requires rewriting the send code for all
    protocols.  Done simply, you just add an pr_sosend entry in the
    protosw structure and set it to sosend() unless there's a protocol
    specific routine.

    NOTE: This change is a pre-requisite for some other performance
    improvements (such as combined checksum/copy) because sosend() is
    where data is copied from user space into the kernel.

Improvement: Replace soreceive()
Performance Benefit: Minor (< 1%) but enables benefits below

    You can simply soreceive() very slightly by making it protocol
    specific like sosend().  More important, you enable a bunch of
    improvements in memory handling.

Improvement: Reduce data copies
Performance Benefit: Large (10%-25% -- results vary see Partridge&Pink 94)

    Currently TCP touches its data 3 times, UDP 2 times, on transmission,
    and similar numbers on receipt.  In both cases, the count should be 1
    (or 0, with hardware assist).

    There are two necessary steps here both easy.

    The easy one is to create a kernel copy routine (typically a version
    of uiomove() and copyin()/copyout()) that computes the Internet
    checksum of the data being copied, while doing the copy.
    Then use this routine in the protocol specific sosend() and soreceive()
    to move data in and out of the kernel.  This change reduces UDP to one
    copy and TCP to two copies.

    To reduce TCP to one copy, you need to make sure the device driver doesn't
    delete the TCP data when a segment is transmitted, so you can point to
    the same data when retransmitting.

    To get to zero copies, you need hardware checksumming (done when DMAing
    to the interface).

    NOTE: Many of these benefits can also be achieved using Copy-On-Write --
    you mark application buffers COW and then don't have to copy them.  You
    still however, need to checksum them, so unless there's hardware checksum
    support, you still scan the data once.

Improvement: Delete IP header checksum call to in_cksum()
Performance Benefit: 2% to 8% (depends on packet size and processor - P&P 94)

    The IP output code calls in_cksum() to checksum the IP header
    checksum.  Since the header checksum requires only 14 instructions
    (without any conditionals) to compute, this is silly (you'll burn
    several times 14 instructions calling in_cksum(), plus harm code
    locality).  Better to do the checksum in ip_output.  Ditto on
    input in ipintr()

Improvement: Delete IP interrupt
Performance Benefit: never measured, estimated to be 20%+

    On the inbound side, the networking code goes through two software
    interrupts, one for IP processing and one for socket processing.

    Given the high cost of doing the interrupt, the IP processing interrupt
    should go away -- IP and partial TCP processing should just be done
    at board interrupt level, then a single interrupt to the socket layer
    should be made to complete TCP processing.  Van Jacobson has done
    preliminary work here but never gotten it to the point of distribution.

Improvement: Get A Better Compiler
Performance Benefit: 10% plus

    There's evidence that compilers that can relocate code segments and
    adjust branches based on actual profiles (so called Profiler Based
    Optimization) can easily give 10% performance improvements.

    Various folks have also done fancier reworking of binary layouts by
    hand and gotten even better results.  (Work at Arizona and UC I believe)

Improvement: Fix PCB lookup
Performance Benefit: 5% or more

    Two issues here.  First, the PCB caches don't work well, especially for
    UDP.

    Second, in_pcblookup() is a linear search -- it should be a hash table
    (see McKenney's paper in SIGCOMM '90).

</nice_tweak_list>

I'd love to see some of these tuneups appear in the internet stack, as they would
be of such a great benefit to pretty much all network-related servers.

Dan

_____________________________________
Daniel J Blueman
BSc Computation, UMIST, Manchester
Email: blue@challenge.isvara.net
Web: http://www.challenge.isvara.net/


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message