Date: Wed, 11 Jul 2001 19:50:21 -0400 From: Leo Bicknell <bicknell@ufp.org> To: freebsd-hackers@freebsd.org Subject: Network performance tuning. Message-ID: <20010711195021.A89324@ussenterprise.ufp.org>
next in thread | raw e-mail | index | archive | help
I'm going to bring up a topic that is sure to spark a great debate (read: flamefest), but I think it's an important issue. I've put my nomex on, let's see where this goes. I work for an international ISP. One of the customer complaints that has been on the rise is poor transfer rates across our network. When these come up, I'll often get called in to investigate. Over the past 2-3 years there has been an alarming increase in these complaints, and what disturbs me more is there is a simple solution 99% of the time - increase the TCP window size. Admittedly, my environment is a bit rare. This generally comes from colo customers who have to 100Mbps connected beefy servers on opposite coasts and can't understand why around 100k/sec is the best transfer rate they can get. If only we all had uncongested 100Mbps connections! Anyway, after having them up the window size on their machines, we can, if necessary, get them up to full 100Mbps across the country (I have logs of 9.98MB/sec FTP's coast to coast, if anyone wants them). So, I decided it was time to pick on FreeBSD. There are a number of reasons, chief among them is that virtually all other OS's now have larger default window sizes (and thus offer better performance) than FreeBSD out of the box. A secondary reason is that there are for the first time real end users, in the form of cable modem subscribers being hit by this same issue. Let's cut to the nitty gritty. This is all limited by the bandwidth * delay product, you can ship one window per rtt, and all that. If you don't understand this already go read about TCP then come back to this message. :-) FreeBSD's current default is 16384 bytes for the window, giving us the following limits on performance: Lan 1ms rtt = 15 MB/sec Coast to Coast 65ms rtt = 246 KB/sec Coast to Coast 85ms rtt = 188 KB/sec East Coast to Japan 155ms rtt = 103 KB/sec London to Japan 225ms rtt = 71 KB/sec T1 Satellite Link 500ms rtt = 32 KB/sec So, inside the US, the current window, 16k, lets a single connection just fill a T1, more or less. Note, these numbers assume optimal conditions, the you may see a degradation of up to 50% from those numbers when bandwidth is available, but there is high jitter, or packets are reordered. I wonder how many people are discontinuing DirectPC service because they can't get over 32 KB/sec downloads from their "T1 speed" satellite service. One of the first responses I often get to this issue is "so what, system administrators can increase the values". This is true, however I think it's time to address the defaults. There are a number of reasons for this: * BOTH ends of a TCP connection must be increased. All the server admins in the world can do this, but if end users don't it is useless. Conversely, end users who do this now won't see a speed up unless all the server admins change the settings. * FreeBSD is at the middle-bottom of the pack when it comes to defaults. http://www.psc.edu/networking/perf_tune.html * Users are slowly getting faster connections (T1 DSL, T1 Satellite, 10 Mbps cable modems) that need larger values. * The methods to get around this limit from a users point of view is to write custom apps that up the values using the socket calls. Hard coding window sizes into apps is a poor solution. Unfortunately this is where things get really interesting. If you want to say, support a 100Mbps transfer over a single TCP connection you need a buffer around 1 Meg. That's a lot of buffer. That said, most large servers, and even end user workstations could devote 1 Meg to the network if it ment 100Mbps performance. Sadly, this has unintended consequences. If you did down in the TCP stack, you find a problem. When a socket is created in FreeBSD (and I presume many other BSD's as well) it's buffer limits are set (soreserve). The behavior today is to set them to the system default values at socket creation time. So, what happens is a dial-up user connects to a web server to download an MP3 file. The socket sets aside a 1 Meg buffer, the web server dumps 1 Meg into it, and then the kernel has to keep that 1 Meg around in MBUF's until it can dribble out to the end user. No surprise, you run out of MBUF's in a hurry. There are a number of issues that come out of this: * MBUF's are currently allocated based on NMBCLUSTERS, which is based on MAXUSERS (unless overridden). NMBCLUSTERS is found using the formula 512 + MAXUSERS * 16. This forumla has been in use for a long time, and it may be time to consider allocating a few more clusters per user. MBUF's is 4 * NMBCLUSTERS, which is a fine number, but testing shows gives you too many MBUF's in many cases. (Or, put another way, most every system I've seen shows a trend of running out of clusters way before MBUF's.) * The socket layer needs to be more intelligent about its buffering. Simply always allocating the largest buffer is easy to code, but wastes considerable resources, particular on machines with lots of connections. So, I'd like to propose some fixes to get people thinking. I have ordered them in the order I think they should be done: 1) The per-socket defaults should be raised to 32k in the next release, giving 2x today's performance in general, and putting FreeBSD on par at least with most Linux distro's. I think the memory consequences here are quite minor, and provide a good place to study the effects on real world people. 2) The socket layer needs to be modified to not use the maximum buffer as the default. Imagine if disk drivers allocated 4 Meg for every process writing to disk, just because the disk has a 4 Meg cache. The buffer clearly needs to hold all unacknowledged data, and should therefor grow as the window size grows, plus some overhead so that some unsent data can be buffered in the kernel (to avoid context switches and the like). This way connections to slow hosts (eg dial up users) would not buffer much more than the window size, using only a small amount of memory. This would allow admins to set the sizes much larger without wasting memory on connections that will never use it. Note, from looking at soreserve and related code it appears it just sets maximums, and that raising it midstream would have no ill effects. (Reducing would.) So a good first stab might be to have a new "initial socket buffer" size passed to soreserve when a new socket is created, and if the TCP window could be increased past that value at any point it could be recalled (or a resize function created) that raised the limit to 2 * maxwin, or 1.1 * maxwin, or maxwin + buffer or whatever is appropriate up to the hard limit set by the system administrator. 3) The number of MBUF's needs to be increased. Ideally this should be dynamically changeable, which it is not today. As the net gets faster, users need more network resources per user, hence more MBUF's. Also, I wonder if it should be determined from MAXUSERS at all. It is in fact related the the maximum number of simultaneous network connections, and it might make more sense to base it off that, with a default based on MAXUSERS (but larger). Point #2 is very critical. Right now it means someone who runs a web server must leave the values fairly low (probably ok for serving dial up and DSL users) to not run out of MBUF's, but without much hackery can't get high speed transfers on the nightly backup run, or content distribution run across the network. Buffers need to be more dynamically scaled to individual connections. So, bottom line, in the end I would like a FreeBSD host that out of the box can get 2-4 MBytes/sec across country (or better), but that manages it in such a way that your standard web server running on a FreeBSD box doesn't fall over. Is it just a pipe dream, or can we make that happen with a little effort? -- Leo Bicknell - bicknell@ufp.org Systems Engineer - Internetworking Engineer - CCIE 3440 Read TMBG List - tmbg-list-request@tmbg.org, www.tmbg.org To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20010711195021.A89324>