Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 7 Mar 1998 22:13:31 -0700 (MST)
From:      Marc Slemko <marcs@znep.com>
To:        hackers@FreeBSD.ORG
Subject:   kernel wishlist for web server performance
Message-ID:  <Pine.BSF.3.95.980307214952.2799A-100000@alive.znep.com>

next in thread | raw e-mail | index | archive | help
Just a few misc comments regarding what I would want in an OS used for a
high performance web server, for anyone who may be considering
implementing any of it in the future.  Am looking at it from the
perspective of designing process and IO models for Apache 2.0.  The plan
is that Apache 2.0 will be able to make use of all of these things where
supported.


Decent kernel threads.

Async IO (AIO) stuff that works for sockets.

A sendfile() (eg. HPUX 11.x) or TransmitFile (eg. WinNT) system call. 
The key features are: 

	- it can transmit from an arbitrary start position and an 
	  arbitrary length.  Sending starting from the current position
	  is ok I guess, but requires a mutex to allow multiple 
	  threads to start it on the same descriptor at the same time and
	  adds the overhead of a seek.
	- An AIO version of this system call would be very useful; 
	  NT can do this with its completion ports API.  This is required
	  to avoid having to dedicate a thread to a connection.

An efficient poll().

Has the MCLBYTES chunking/segment size thing been fixed yet?  ie.
in the 4.4BSD code data is copied in chunks of MCLBYTES and popped
on the send queue, but tcp_output doesn't know there is more to come
so it generates lots of small segments.  ie. you end up with a 
lot of 832 or 608 byte segments on the network for no reason.

Oh, and regarding the problem with slow start and delayed ACKs and
writes between 100 and 208 bytes putting two segments on the network,
David Borman had this to say:

>I know what problem you are referring to, but it is not as you describe
>it.  For a non-atomic protocol (like TCP) sosend() will allocate a
>cluster if the data won't fit in the mbuf, even if it is over by only
>one byte.  This puts a small amount of data into a cluster.  It
>doesn't take very many of these small writes until sb->sb_mbcnt bumps
>into sb->sb_mbmax, long before sb->sb_cc hits sb->sb_hiwat.  So, you
>get a socket send buffer without much data to send, which can't accept
>any more data from the user.  You wind up waiting for the delayed ACKs
>from the remote side to clear out buffer space, but it is never enough
>to allow you to get 2 full packets out to get past the delayed ACKs!
>The problem is that sbcompress() does not compress cluster mbufs, to
>avoid excessive data copies.  I've modified sbcompress() (in the next
>release of BSD/OS) to allow cluster mbufs to be compressed, provided
>that all the data can be copied and there is no more than 1/4 of a
>cluster to copy.  This change allows enough data to be copied down
>from the user to get out 2 full packets and thus get past the delayed
>ACKs.  The benchmark will still run slow because of the tiny writes,
>and the extra data copies, but at least it no longer runs an order or
>two of magnitude slower than really tiny writes (<= the size of an mbuf).



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.3.95.980307214952.2799A-100000>