Date: Tue, 01 Jan 2002 05:22:59 -0800 From: Terry Lambert <tlambert2@mindspring.com> To: "Louis A. Mamakos" <louie@TransSys.COM> Cc: Matthew Dillon <dillon@apollo.backplane.com>, Julian Elischer <julian@elischer.org>, Mike Silbersack <silby@silby.com>, Josef Karthauser <joe@tao.org.uk>, Tomas Svensson <tsn@gbdev.net>, freebsd-hackers@FreeBSD.ORG Subject: Re: FreeBSD performing worse than Linux? Message-ID: <3C31B833.FF98C07@mindspring.com> References: <Pine.BSF.4.21.0112311225150.94344-100000@InterJet.elischer.org> <200112312327.fBVNRt719835@whizzo.transsys.com> <200201010043.g010h0i36281@apollo.backplane.com> <3C311AC9.99B5FC9C@mindspring.com> <200201010246.g012ko721041@whizzo.transsys.com>
next in thread | previous in thread | raw e-mail | index | archive | help
"Louis A. Mamakos" wrote: > > Disabling Nagle's algorithm for no good reason has very poor > scaling behavior. This is what happens when TCP_NODELAY is > enabled on a socket. Disabling Nagle's algorithm for a good reason would still result in the observed failure, however. > If you look at the work function for most network elements, the part > that runs out of gas first is per-packet forwarding performance. Sure, > you need to have adequate bus bandwidth to move stuff through a box, > but it's performing per-packet forwarding operations and policy which > is the resource that's most difficult to make more of. I think this is > true for toy routers based on PC platform as well as high-end boxes like > the Cisco 12000 series. Juniper managed adequate forwarding performance > using specialized ASIC implementions in the forwarding path. Of this > statement, I'm sure; in my day job at UUNET, I talk to all the major > backbone router vendors, and forwarding performance (and also > reasonable routing protocol implementions) is a show-stopper > requirement they labor mightily over. PCI is sufficient to keep a Gigabit interface saturated, even without going to jumbograms. I have personally saturated such an interface. PCI-X will scale to 8 Gigabits. > So here was have a mechanism with wonderful properties - it's a > trivial yet clever implementation of a self tuning mechanism to > prevent tinygrams from being generated by a TCP without all manner > of complicated timers. It give great performance on LAN and other > high-speed interconnects where remote echo type applications are > demanding, yet over long delay paths where remote echo is gonna suck > no matter what you do, it automatically aggregates packets. As a bandwidth provider, UUNET is more concerned with aggregate throughput; this means it cares more about the moving average of packets getting through than it does about *my* packets getting through. When it comes to a conflict of interest, you will understand my preferences are for my own interests... 8-). > Nagle's algorithm and Van Jacobson's slow-start algorithm allowed the > Internet to survive over congested paths. And they did so with > a bunch of self-tuning behavior independent of the bandwidth*delay > product of the path the connection was running over. It was and is > amazing stuff. Yes, I'm well aware of bandwidth delay product calculations; it's the primary mechanism behind the rate halving algorithm I keep pointing to (Hoe, Jacobson). 8-). It's also the primary limitation on connection speed (remember that it was my FreeBSD machine that was able to get to 1.6M connections, with standard sockets). > Likewise, the original problem in this thread is likely caused by some > part of the USB Ethernet implementation having inadequate per-packet > resources. It's probably not about the number of bytes, but the number of > transactions. You see here a modern reimplementation of essentially the same > problem that the 3COM 3C501 ISA ethernet card had 15 years ago - back to > back packets were consistantly dropped because of the poor per-packet > buffering implementation. It was absolutely repeatable. Clearly. I think that's well established, since no one has squawked about the FreeBSD USB driver or the PC USB hardware being slower than the dongle USB hardware... > Sure, it's "legal" to generate streams of tinygrams and not use Nagle's > algorithm to aggregate the sender's traffic, but it's just plain rude > and on low bandwidth links, it sucks because of all the extra 40 byte > headers you're carrying around. I understand this. But Nagle is not the only mechanism which would fix the problem. Given that it's *intentionally* possible *and permitted* to turn Nagle off (via TCP_NODELAY), it makes sense to look at another mechanism that is not succeptible to being turned off. > I'm sure TCP_NODELAY got added because it sounds REALLY C00L to make > the interactive thing go better. But clearly people don't understand > the impact of turning on the cleverly named option and how it probably > doesn't really improve things. I'm pretty sure it got added to address interactive response on intrinsically small packets; telnet was probably the number one reason, but interactive responsiveness of small requests over TCP, such as non-pipelined HTTP requests, SMTP server responses, FTP control channel traffic, and other protocols benefit significantly from turning Nagle off. Nagle almost suggests it, in the original paper. Since OpenSSH incredibly bloats payload, it's much less necessary to get performance as it would be out of Telnet, though you will likely see least common multiples of the MTU cause occasionaly burstiness. Probably in this case, it would make sense to increase the decay rate of the timer based on the amount of data that is pending, relative to the MTU: defeat Nagle when the pipe is full, and partially defeat it down to, say, the pipe being half full. Thus large streams not ending on MTU boundaries would not suffer under Nagle. Remember also that Matt recently shot Reno for performance reasons, when compared to Linux, when he should probably have simply cranked initial window size to 3 (Jacobson) and added piggy-back ACKs (Mogul). While I'm sure the shooting is actually temporary, and eventually it will end up back on, once the performance issue is addressed correctly, realize that there is heavy pressure in the form of benchmarks to deal with, where speed is more important. In fact, until Windows 2000 (which has a BSDI ported stack that cost them ~$3M in fees), the Microsoft stacks routinely violated the RFCs on connection closing using RSTs to avoid the FIN_WAIT_2 issue, while at the same time potentially leaving the peer stuck forever (since RSTs are not retransmitted); you still have to set a registry entry to get correct behaviour, even today. I think with much faster links becoming common (802.11e is 5GHz over the air [Apple Trademark "GigaWire"]), we will see the MSL drop significantly; we already have problems with sequence number recycling with the random jump forward for "security" purposes to avoid session take-over -- even though we all know that end-to-end doctorine means that security is not correctly handled, if implemented at that layer. 2MSL is already incredibly huge, compared to the cycle time for 32bit sequence numbers on a 1Gbit link. I really like the self-clocking in the rate halving algorithm, but I guess that's pretty obvious by now. 8-). -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3C31B833.FF98C07>