From owner-freebsd-net Wed Nov 10 14:12: 3 1999 Delivered-To: freebsd-net@freebsd.org Received: from cs.rice.edu (cs.rice.edu [128.42.1.30]) by hub.freebsd.org (Postfix) with ESMTP id A9DC8153CD; Wed, 10 Nov 1999 14:11:51 -0800 (PST) (envelope-from aron@cs.rice.edu) Received: (from aron@localhost) by cs.rice.edu (8.9.0/8.9.0) id QAA12891; Wed, 10 Nov 1999 16:11:50 -0600 (CST) From: Mohit Aron Message-Id: <199911102211.QAA12891@cs.rice.edu> Subject: FreeBSD networking problems To: freebsd-net@freebsd.org, wollman@freebsd.org, jlemon@freebsd.org, julian@freebsd.org, ee@freebsd.org, bright@wintelcom.net Date: Wed, 10 Nov 1999 16:11:49 -0600 (CST) X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-net@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Hi, I've noticed several problems in networking performance in FreeBSD wrt WAN conditions in the course of my experiments. I mailed them to Alfred Perlstein who suggested that I post them to this list. I'm listing them below. Problems with WAN emulation in lab environments: 1) FreeBSD tries to determine the max size of socket buffers from cached routing information. This is done even after an application wants to set the application buffer to a large value. The result is that you usually end up having a socket buffer size that you got from an earlier TCP connection (say telnet) which is usually very small. The code is related to the 'ifdef RTV_SPIPE' and 'ifdef RTV_RPIPE' in sys/netinet/tcp_input.c. For my experiments, I usually undefine RTV_SPIPE and RTV_RPIPE in tcp_input.c. A more complete discussion is given in a PR that I filed a while back and can be viewed from: http://www.freebsd.org/cgi/query-pr.cgi?pr=11966 2) TCP Bug - the FreeBSD implementation does not scale the advertised window immediately when it discovers that window scaling is being used. The result is that irrespective of advertised window, in the first round-trip after connection establishment, a FreeBSD TCP sender cannot send more data than the unscaled value of advertised window. The fix is the following patch to tcp_input.c (taken from FreeBSD-3.3-RELEASE): --- /sys/netinet/tcp_input.c Sun Aug 29 11:29:54 1999 +++ tcp_input.c Wed Nov 10 15:39:49 1999 @@ -857,6 +857,9 @@ (TF_RCVD_SCALE|TF_REQ_SCALE)) { tp->snd_scale = tp->requested_s_scale; tp->rcv_scale = tp->request_r_scale; + + tp->snd_wnd <<= tp->snd_scale; + tiwin = tp->snd_wnd; } /* Segment is acceptable, update cache if undefined. */ if (taop->tao_ccsent == 0) One can argue that this is not important given that TCP does slow-start in the first round-trip. Well, people are looking at rate-based pacing where you don't have to do slow-start. Also the above is important in LANs where FreeBSD doesn't use slow-start. In my case, I'm emulating a WAN to see the benefits of rate-based pacing and so is extremely important. 3) sbappend is unscalable. I've earlier posted this on freebsd-net and can be obtained from the archive from: http://docs.freebsd.org/cgi/getmsg.cgi?fetch=58270+0+archive/1999/freebsd-net/19991010.freebsd-net Here's a suggested fix. Maintain one additional pointer to the last pkt in the send socket buffer (Alfred alreay has a patch for this available from http://www.freebsd.org/~alfred/sockbuf-3.3-release.diff). However, this is not sufficient because for TCP, the data is maintained in a single chain of mbufs headed by a single packet header mbuf. Thus an additional pointer in each packet header mbuf is needed. Perhaps the m_pkthdr.rcvif field can be used for this purpose - this field is not used for outbound packets. One can modify the mbuf data structure to replace this field with a union whose other element has a name like m_pkttail. Otherwise if increasing the length of the data structure is not a concern, then perhaps a completely new field can be added that can perhaps allow mbufs to be mainted in a Tailq. 4) FreeBSD-3.x onwards introduced a limit on the maximum number of sockets (can be viewed with 'sysctl kern.ipc.maxsockets' - its typically less than 5000 and depends upon MAXUSERS). The reason for this limit was the new zone allocator scheme introduced in FreeBSD-3.x. I've shown in a prior paper (http://www.cs.rice.edu/~aron/papers/rice-TR99-335.ps.gz) that a busy webserver can have upto 50000 open connections and so having just 5000 sockets is going to have dismal performance with servers. The big number is due to connections in TCP TIME_WAIT state. The paper above also proposes an alternate fix where the TIME_WAIT state operates with minimal amount of state. 5) The interface queues need to be increased from the default of 50 packets (defined as IFQ_MAXLEN in sys/net/if.h). I normally increase this value to 1000. A busy webserver can easily overflow the default of 50. It is also important for my lab tests with WAN conditions (although this is not a case for increasing it in the general FreeBSD distribution). Consider a 100Mbps link with a round-trip delay of 100ms. It can hold upto 833 packets. In a lab environment, these can be queued up in the driver and thus the need for higher interface queue. Additionally FreeBSD-3.x introduced a change to the fxp driver (in sys/pci/if_fxp.c) where it ignores the IFQ_MAXLEN setting for the output driver queue and instead sets it to the number of its own transmit buffers (127 by default). I think this feature should be removed - the older FreeBSD-2.2.x used to only put more pkts (> 127) in the driver once there was room - all others were queued up in the interface queue whose length was determined by IFQ_MAXLEN. 6) The value of SB_MAX (defined in sys/sys/socketvar.h) needs to be increased from the default of 256K. In my WAN experiments, the bandwidth-delay product was 1250K - I think SB_MAX should be increased to at least this value because high bandwidths in WANs are just around the corner. Moreover, having this value of SB_MAX doesn't mean that this memory is going to be reserved for each socket - only that applications that need such memory can use it. I earlier posted some additional tuning parameters wrt running webservers on FreeBSD. These are available from: http://docs.freebsd.org/cgi/getmsg.cgi?fetch=131178+0+archive/1999/freebsd-net/19990725.freebsd-net - Mohit To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-net" in the body of the message