Date: Fri, 13 Jul 2001 10:11:07 -0400 From: Leo Bicknell <bicknell@ufp.org> To: freebsd-hackers@freebsd.org Subject: Network performance roadmap. Message-ID: <20010713101107.B9559@ussenterprise.ufp.org>
next in thread | raw e-mail | index | archive | help
After talking with a number of people and reading some more papers, I think I can put together a better road map for what should be done to increase network performance. The good news is there are some immediate bandaids. In particular, I'd like those who are working on comitting network changes to -current to pay attention here, as I can't commit. :-) Let's go through some new details: 1) FreeBSD's TCP windows cannot grow large enough to allow for optimum performance. The primary obstical to raising them is that if you do so, the system can run out of MBUF's. Schemes need to be put in place to limit MBUF usage, and better allocate buffers per connection. 2) Windows are currently 16k. It seems a wide number of people think 32k would not cause major issues, and is in fact in use by many other OS's at this time. There are a few other observations that have been made that are important. A) The receive buffers are hardly used. In fact, data generally only sits in a receive buffer for one of two reasons. First, the data has not yet been passed to the application. This amount of data is generally very small. Second, data for unacknowledged segments will sit in the buffer waiting for a retransmit. It is of course possible that the buffers could be completely full from either case, but several research papers indicate that receive buffers rarely use much space at all. B) When the system runs out of MBUF's, really bad things happen. It would be nice to make the system handle MBUF exhaustion in a nicer way, or avoid it. C) Many people think TCP_EXTENSIONS="YES" gives them windows > 64k. It does, in the sense that it allows the window scale option, but it doesn't in that socket buffers aren't changed. From all of this, I propose the following short term road map: a - Commit higher socket buffer sizes: -current: 64k receive (based on observation A) 32k send (based on detail 2) -stable: 32k receive (based on detail 2) 32k send (based on detail 2) I think this can be done more or less immediately. b - Allow larger receive windows in some cases. In -current only, if TCP_EXTENSIONS="YES" is configured (turn on RFC1323 extensions) change the settings to: 1M kernel limit (based on observation C) 256k receive (based on observation A, C) 64k send (based on observation C) Note, 64k send is most likely agressive with the current MBUF problems. Some later points will address that. For now, the basic assumption is that people configuring TCP_EXTENSIONS are clueful people with larger memory machines who also tune things like MAXUSERS up, so they will probably be ok. c - Prevent MBUF exhaustion. Today, when you run out of MBUF's, bad things start to happen. It would be nice to prevent that from happening, and also to provide sysadmins some warning when it is about to happen. This change sounds easy, but I don't know where in the code to start looking. Basically, there is a bit of code somewhere that decides if a sending TCP process should block or not. Today this code only looks to see if that socket's TCP send buffer is full. What I propose is that it should also check if less than 10% of the MBUF's are free, and if so also block the sender. Blocking senders keeps some MBUF's free for receivers (the 10%), most likely keeping the system from running out. What will happen is receivers will either read data from the receive buffers, or data will "drain" from the send buffers until enough MBUF's are free to unblock the senders. I believe a message should be logged when this happens (a la a full file system) so admins know they are running low on MBUF's. I would think this would only be a couple of line patch to the function that decides if a particular socket could block. Could someone more familiar with the code comment? d - Prevent gross overbuffering on a sender. In a TCP stream there are several interesting variables: --------------------------------------------------------------- A B C D e F G A = lowest acknowledged segment B = highest transmitted segment C = A + cwin D = A + win e = my new variable F = A + buffer_in_use G = A + max_buffer_size Note that the following must always be true: A <= B <= C, B <= C <= D, F <= G, G - A <= sendspace Note, all the capital letter values are readily available, either directly tracked in variables, or easily computable from variables that are tracked. Now, in today's world (for senders) if F < G then we unblock the sending process, and allow it to put data into the buffer. This means that in general, F = G, we always have a full send buffer. This is the crux of running out of MBUF's when you have slow clients connected. So, I propose a new value e. This is the desired buffer length. The first observation is that if the receiver gives us a smaller window, there's no reason to buffer much more than that window on our side. That is, e should only be "a little" bigger than D. So we need a new constant, SPB - socket process buffer, the amount of data we want to buffer in the kernel from a process for a socket. I'll propose 8k as a good starting point. This gives us an upper bound for e, D + SPB. We also always want to buffer some data. Even if the window (or other factors I'll talk about next) are 0, we want to buffer something. SPB is a good value here. We also have to be careful not to exceed the hard limits. This gives us: SPB <= e <= min(D + SPB, G) No, going back to the code. When we check if a process should unblock, rather than checking if F < G, it should check if e < D + SPB, and allow D + SPB - e bytes to be read in. This way if a receiver gives us a window of 16k (which older freebsd boxes will be doing for quite a while) we buffer 16k + SPB bytes at most. Not a bad tradeoff for almost no code! Of course, the drawback here is obvious. Let's say the sender advertises a large window, say G sized, but is on a slow/congested link and can't use that window. We could be overbuffering again. So, we need to look at a second critera. We now have a rage for e, how do we pick it. I'll borrow from PSC.edu's research here. It should be in the range 2 * cwin to 4 * cwin. So, every time cwin is updated, we look at e. If it's less than 2 * cwin, we increase it to 2 * cwin (or it's max value), and every time it's greater than 4 * cwin, we decrease it to 2 * cwin, or it's minimum value. This is in fact their "autotuning", but without the "fair share" component, which so far everyone seems to think is too complicated and there should be a better solution. The good news is this part I think is very little code. We need to track e, so one more variable. I'd venture the code in the block/unblock section is probably 4-5 lines, at most, and the code in the cwin update section is another 4-5 lines, at most. If this all became 20 lines of code I'd be surprised. e - Once we have this better management in place, we can go back to new values. Assuming d is in current, I'd then like to see: kernel max 1M sendspace 256k recevspace 256k f - At this point we can look at a "fair share" replacement. Since we have the MBUF warning code in c we can get some idea of the cases where it's needed. The basic premise is you don't have enough MBUF's, and connections need more buffer space, so how do you fairly allocate the space that you have. That's an interesting question, but it can wait for some other day. :-) I would think we could have a, b, and c done by the end of next week, and d within two weeks assuming some people familar with the network code can help with some pointers. -- Leo Bicknell - bicknell@ufp.org Systems Engineer - Internetworking Engineer - CCIE 3440 Read TMBG List - tmbg-list-request@tmbg.org, www.tmbg.org To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20010713101107.B9559>