Date: Fri, 13 Jul 2001 10:11:07 -0400 From: Leo Bicknell <bicknell@ufp.org> To: freebsd-hackers@freebsd.org Subject: Network performance roadmap. Message-ID: <20010713101107.B9559@ussenterprise.ufp.org>
index | next in thread | raw e-mail
After talking with a number of people and reading some more papers,
I think I can put together a better road map for what should be
done to increase network performance. The good news is there are
some immediate bandaids.
In particular, I'd like those who are working on comitting network
changes to -current to pay attention here, as I can't commit. :-)
Let's go through some new details:
1) FreeBSD's TCP windows cannot grow large enough to allow for
optimum performance. The primary obstical to raising them is
that if you do so, the system can run out of MBUF's. Schemes
need to be put in place to limit MBUF usage, and better allocate
buffers per connection.
2) Windows are currently 16k. It seems a wide number of people
think 32k would not cause major issues, and is in fact in use
by many other OS's at this time.
There are a few other observations that have been made that are
important.
A) The receive buffers are hardly used. In fact, data generally
only sits in a receive buffer for one of two reasons. First,
the data has not yet been passed to the application. This amount of
data is generally very small. Second, data for unacknowledged
segments will sit in the buffer waiting for a retransmit. It is of
course possible that the buffers could be completely full from either
case, but several research papers indicate that receive buffers
rarely use much space at all.
B) When the system runs out of MBUF's, really bad things happen. It
would be nice to make the system handle MBUF exhaustion in a nicer
way, or avoid it.
C) Many people think TCP_EXTENSIONS="YES" gives them windows > 64k.
It does, in the sense that it allows the window scale option, but
it doesn't in that socket buffers aren't changed.
From all of this, I propose the following short term road map:
a - Commit higher socket buffer sizes:
-current: 64k receive (based on observation A)
32k send (based on detail 2)
-stable: 32k receive (based on detail 2)
32k send (based on detail 2)
I think this can be done more or less immediately.
b - Allow larger receive windows in some cases. In -current
only, if TCP_EXTENSIONS="YES" is configured (turn on RFC1323
extensions) change the settings to:
1M kernel limit (based on observation C)
256k receive (based on observation A, C)
64k send (based on observation C)
Note, 64k send is most likely agressive with the current MBUF
problems. Some later points will address that. For now, the
basic assumption is that people configuring TCP_EXTENSIONS are
clueful people with larger memory machines who also tune things like
MAXUSERS up, so they will probably be ok.
c - Prevent MBUF exhaustion. Today, when you run out of MBUF's, bad
things start to happen. It would be nice to prevent that from
happening, and also to provide sysadmins some warning when it is
about to happen.
This change sounds easy, but I don't know where in the code to start
looking. Basically, there is a bit of code somewhere that decides
if a sending TCP process should block or not. Today this code only
looks to see if that socket's TCP send buffer is full. What I
propose is that it should also check if less than 10% of the MBUF's
are free, and if so also block the sender.
Blocking senders keeps some MBUF's free for receivers (the 10%),
most likely keeping the system from running out. What will happen
is receivers will either read data from the receive buffers, or
data will "drain" from the send buffers until enough MBUF's are
free to unblock the senders.
I believe a message should be logged when this happens (a la a
full file system) so admins know they are running low on MBUF's.
I would think this would only be a couple of line patch to the
function that decides if a particular socket could block. Could
someone more familiar with the code comment?
d - Prevent gross overbuffering on a sender. In a TCP stream there are
several interesting variables:
---------------------------------------------------------------
A B C D e F G
A = lowest acknowledged segment
B = highest transmitted segment
C = A + cwin
D = A + win
e = my new variable
F = A + buffer_in_use
G = A + max_buffer_size
Note that the following must always be true:
A <= B <= C, B <= C <= D, F <= G, G - A <= sendspace
Note, all the capital letter values are readily available, either
directly tracked in variables, or easily computable from variables
that are tracked.
Now, in today's world (for senders) if F < G then we unblock the
sending process, and allow it to put data into the buffer. This
means that in general, F = G, we always have a full send buffer.
This is the crux of running out of MBUF's when you have slow clients
connected.
So, I propose a new value e. This is the desired buffer length.
The first observation is that if the receiver gives us a smaller
window, there's no reason to buffer much more than that window on
our side. That is, e should only be "a little" bigger than D.
So we need a new constant, SPB - socket process buffer, the amount
of data we want to buffer in the kernel from a process for a socket.
I'll propose 8k as a good starting point.
This gives us an upper bound for e, D + SPB.
We also always want to buffer some data. Even if the window (or
other factors I'll talk about next) are 0, we want to buffer
something. SPB is a good value here. We also have to be careful
not to exceed the hard limits. This gives us:
SPB <= e <= min(D + SPB, G)
No, going back to the code. When we check if a process should
unblock, rather than checking if F < G, it should check if e < D +
SPB, and allow D + SPB - e bytes to be read in. This way if a
receiver gives us a window of 16k (which older freebsd boxes will be
doing for quite a while) we buffer 16k + SPB bytes at most. Not a
bad tradeoff for almost no code!
Of course, the drawback here is obvious. Let's say the sender
advertises a large window, say G sized, but is on a slow/congested
link and can't use that window. We could be overbuffering again.
So, we need to look at a second critera. We now have a rage for
e, how do we pick it.
I'll borrow from PSC.edu's research here. It should be in the range
2 * cwin to 4 * cwin. So, every time cwin is updated, we look at
e. If it's less than 2 * cwin, we increase it to 2 * cwin (or it's
max value), and every time it's greater than 4 * cwin, we decrease
it to 2 * cwin, or it's minimum value.
This is in fact their "autotuning", but without the "fair share"
component, which so far everyone seems to think is too complicated
and there should be a better solution. The good news is this part I
think is very little code. We need to track e, so one more
variable. I'd venture the code in the block/unblock section is
probably 4-5 lines, at most, and the code in the cwin update section
is another 4-5 lines, at most. If this all became 20 lines of code
I'd be surprised.
e - Once we have this better management in place, we can go back to new
values. Assuming d is in current, I'd then like to see:
kernel max 1M
sendspace 256k
recevspace 256k
f - At this point we can look at a "fair share" replacement. Since we
have the MBUF warning code in c we can get some idea of the cases
where it's needed. The basic premise is you don't have enough
MBUF's, and connections need more buffer space, so how do you fairly
allocate the space that you have. That's an interesting question,
but it can wait for some other day. :-)
I would think we could have a, b, and c done by the end of next week,
and d within two weeks assuming some people familar with the network
code can help with some pointers.
--
Leo Bicknell - bicknell@ufp.org
Systems Engineer - Internetworking Engineer - CCIE 3440
Read TMBG List - tmbg-list-request@tmbg.org, www.tmbg.org
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message
help
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20010713101107.B9559>
