From owner-freebsd-net@freebsd.org Fri Nov 27 01:57:46 2015 Return-Path: Delivered-To: freebsd-net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4D72DA3A73F for ; Fri, 27 Nov 2015 01:57:46 +0000 (UTC) (envelope-from mmacy@nextbsd.org) Received: from sender163-mail.zoho.com (sender163-mail.zoho.com [74.201.84.163]) (using TLSv1 with cipher ECDHE-RSA-AES128-SHA (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 3E34D1A7F for ; Fri, 27 Nov 2015 01:57:45 +0000 (UTC) (envelope-from mmacy@nextbsd.org) Received: from mail.zoho.com by mx.zohomail.com with SMTP id 1448589456055734.7743441994829; Thu, 26 Nov 2015 17:57:36 -0800 (PST) Date: Thu, 26 Nov 2015 17:57:35 -0800 From: Matthew Macy To: "freebsd-net@freebsd.org" Message-ID: <15146a8f285.b094791a15089.3823664487014698900@nextbsd.org> Subject: TCP notes and incast recommendations MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Priority: Medium User-Agent: Zoho Mail X-Mailer: Zoho Mail X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 27 Nov 2015 01:57:46 -0000 In an effort to be somewhat current on the state TCP I've collected a small= bibliography. I've tried to summarize RFCs and papers that I believe to be important and provide som= e general background for others who do not have a deeper familiarity with TCP or congestion control = - in particular as impacts DCTCP. Recommendations references phabricator changes. Table Of Contents: I) - A Roadmap for Transmission Control Protocol (TCP) Specification Documents (RFC 7414) II) - Metrics for the Evaluation of Congestion Control Mechanisms =09(RFC 5166) III) - TCP Congestion Control (RFC 5681) IV) - Computing TCP's Retransmission Timer (RFC 6298) V) - Increasing TCP's Initial Window (RFC 6928) VI) - TCP Extensions for High Performance [RTO updates =09and changes to RFC 1323] (RFC 7323) VII) - Updating TCP to Support Rate-Limited Traffic =09[Congestion Window Validation] (RFC 7661) VIII) - Active Queue Management (AQM) IX) - Explicit Congestion Notification (ECN) X) - AccurateECN (AccECN) XI) - Incast Causes and Solutions XII) - Data Center Transmission Control Protocol (DCTCP) XIII) - Incast TCP (ICTCP) XIV) - Quantum Congestion Notification (QCN) XV) - Recommendations A Roadmap for Transmission Control Protocol (TCP) Specification Documents [important]: https://tools.ietf.org/html/rfc7414 A correct and efficient implementation of the Transmission Control Protocol (TCP) is a critical part of the software of most Internet hosts. As TCP has evolved over the years, many distinct documents have become part of the accepted standard for TCP. At the same time, a large number of experimental modifications to TCP have also been published in the RFC series, along with informational notes, case studies, and other advice. As an introduction to newcomers and an attempt to organize the plethora of information for old hands, this document contains a roadmap to the TCP-related RFCs. It provides a brief summary of the RFC documents that define TCP. This should provide guidance to implementers on the relevance and significance of the standards-track extensions, informational notes, and best current practices that relate to TCP. This roadmap includes a brief description of the contents of each TCP-related RFC [N.B. I only include an excerpt of the summary for those that I consider interesting or important]. In some cases, we simply sup= ply=20 the abstract or a key summary sentence from the text as a terse descript= ion. =20 In addition, a letter code after an RFC number indicates its category in= the RFC series (see BCP 9 [RFC2026] for explanation of these categories): S - Standards Track (Proposed Standard, Draft Standard, or Internet Standard) E - Experimental I - Informational H - Historic B - Best Current Practice U - Unknown (not formally defined) [2.] Core Functionality A small number of documents compose the core specification of TCP. These define the required core functionalities of TCP's header parsing, state machine, congestion control, and retransmission timeout computation. These base specifications must be correctly followed for interoperability. RFC 793 S: "Transmission Control Protocol", STD 7 (September 1981) (Errata) This is the fundamental TCP specification document [RFC793]. Written by Jon Postel as part of the Internet protocol suite's core, it describes the TCP packet format, the TCP state machine and event processing, and TCP's semantics for data transmission, reliability, flow control, multiplexing, and acknowledgment. RFC 1122 S: "Requirements for Internet Hosts - Communication Layers" (October 1989) This document [RFC1122] updates and clarifies RFC 793 (see above in Section 2), fixing some specification bugs and oversights. It also explains some features such as keep-alives and Karn's and Jacobson's RTO estimation algorithms [KP87][Jac88][JK92]. ICMP interactions are mentioned, and some tips are given for efficient implementation. RFC 1122 is an Applicability Statement, listing the various features that MUST, SHOULD, MAY, SHOULD NOT, and MUST NOT be present in standards-conforming TCP implementations. Unlike a purely informational roadmap, this Applicability Statement is a standards document and gives formal rules for implementation. RFC 2460 S: "Internet Protocol, Version 6 (IPv6) Specification" (December 1998) (Errata) This document [RFC2460] is of relevance to TCP because it defines how the pseudo-header for TCP's checksum computation is derived when 128-bit IPv6 addresses are used instead of 32-bit IPv4 addresses. Additionally, RFC 2675 (see Section 3.1 of this document) describes TCP changes required to support IPv6 jumbograms. RFC 2873 S: "TCP Processing of the IPv4 Precedence Field" (June 2000) (Errata) This document [RFC2873] removes from the TCP specification all processing of the precedence bits of the TOS byte of the IP header. This resolves a conflict over the use of these bits between RFC 793 (see above in Section 2) and Differentiated Services [RFC2474]. RFC 5681 S: "TCP Congestion Control" (August 2009) Although RFC 793 (see above in Section 2) did not contain any congestion control mechanisms, today congestion control is a required component of TCP implementations. This document [RFC5681] defines congestion avoidance and control mechanism for TCP, based on Van Jacobson's 1988 SIGCOMM paper [Jac88]. A number of behaviors that together constitute what the community refers to as "Reno TCP" is described in RFC 5681. The name "Reno" comes from the Net/2 release of the 4.3 BSD operating system. This is generally regarded as the least common denominator among TCP flavors currently found running on Internet hosts. Reno TCP includes the congestion control features of slow start, congestion avoidance, fast retransmit, and fast recovery. RFC 5681 details the currently accepted congestion control mechanism, while RFC 1122, (see above in Section 2) mandates that such a congestion control mechanism must be implemented. RFC 5681 differs slightly from the other documents listed in this section, as it does not affect the ability of two TCP endpoints to communicate; RFCs 2001 and 2581 are the conceptual precursors of RFC 5681. The most important changes relative to RFC 2581 are: (a) The initial window requirements were changed to allow larger Initial Windows as standardized in [RFC3390] (see Section 3.2 of this document). (b) During slow start and congestion avoidance, the usage of Appropriate Byte Counting [RFC3465] (see Section 3.2 of this document) is explicitly recommended. (c) The use of Limited Transmit [RFC3042] (see Section 3.3 of this document) is now recommended. RFC 6093 S: "On the Implementation of the TCP Urgent Mechanism" (January 2011) This document [RFC6093] analyzes how current TCP stacks process TCP urgent indications, ... and recommends against the use of urgent= =20 mechanism. RFC 6298 S: "Computing TCP's Retransmission Timer" (June 2011) Abstract of RFC 6298 [RFC6298]: "This document defines the standard algorithm that Transmission Control Protocol (TCP) senders are required to use to compute and manage their retransmission timer. It expands on the discussion in Section 4.2.3.1 of RFC 1122 and upgrades the requirement of supporting the algorithm from a SHOULD to a MUST." RFC 6298 updates RFC 2988 by _changing_ the initial RTO from _3s_ to _1s_ [emphasis mine]. RFC 6691 I: "TCP Options and Maximum Segment Size (MSS)" (July 2012) This document [RFC6691] clarifies what value to use with the TCP Maximum Segment Size (MSS) option when IP and TCP options are in use. [3.] Strongly Encouraged Enhancements This section describes recommended TCP modifications that improve performance and security. Section 3.1 represents fundamental changes to the protocol. Sections 3.2 and 3.3 list improvements over the congestion control and loss recovery mechanisms as specified in RFC 5681 (see Section 2). Section 3.4 describes algorithms that allow a TCP sender to detect whether it has entered loss recovery spuriously. Section 3.5 comprises Path MTU Discovery mechanisms. Schemes for TCP/IP header compression are listed in Section 3.6. Finally, Section 3.7 deals with the problem of preventing acceptance of forged segments and flooding attacks. [3.1.] Fundamental Changes RFCs 2675 and 7323 represent fundamental changes to TCP by redefining how parts of the basic TCP header and options are interpreted. RFC 7323 defines the Window Scale option, which reinterprets the advertised receive window. RFC 2675 specifies that MSS option and urgent pointer fields with a value of 65,535 are to be treated RFC 2675 S: "IPv6 Jumbograms" (August 1999) (Errata) RFC 7323 S: "TCP Extensions for High Performance" (September 2014) This document [RFC7323] defines TCP extensions for window scaling, timestamps, and protection against wrapped sequence numbers, for efficient and safe operation over paths with large bandwidth-delay products. These extensions are commonly found in currently used systems. The predecessor of this document, RFC 1323, was published in 1992, and is deployed in most TCP implementations. This document includes fixes and clarifications based on the gained deployment experience. One specific issued addressed in this specification is a recommendation how to modify the algorithm for estimating the mean RTT when timestamps are used. RFCs 1072, 1185, and 1323 are the conceptual precursors of RFC 7323. [3.2.] Congestion Control Extensions Two of the most important aspects of TCP are its congestion control and loss recovery features. TCP treats lost packets as indicating congestion-related loss and cannot distinguish between congestion- related loss and loss due to transmission errors. Even when ECN is in use, there is a rather intimate coupling between congestion control and loss recovery mechanisms. There are several extensions to both features, and more often than not, a particular extension applies to both. In these two subsections, we group enhancements to TCP's congestion control, while the next subsection focus on TCP's loss recovery. RFC 3168 S: "The Addition of Explicit Congestion Notification (ECN) to IP" (September 2001) This document [RFC3168] defines a means for end hosts to detect congestion before congested routers are forced to discard packets. Although congestion notification takes place at the IP level, ECN requires support at the transport level (e.g., in TCP) to echo the bits and adapt the sending rate. This document updates RFC 793 (see Section 2 of this document) to define two previously unused flag bits in the TCP header for ECN support. RFC 3390 S: "Increasing TCP's Initial Window" (October 2002) This document [RFC3390] specifies an increase in the permitted initial window for TCP from one segment to three or four segments during the slow start phase, depending on the segment size. RFC 3465 E: "TCP Congestion Control with Appropriate Byte Counting (ABC)" (February 2003) This document [RFC3465] suggests that congestion control use the number of bytes acknowledged instead of the number of acknowledgments received. This change improves the performance of TCP in situations where there is no one-to-one relationship between data segments and acknowledgments (e.g., delayed ACKs or ACK loss). ABC is recommended by RFC 5681 (see Section 2). RFC 6633 S: "Deprecation of ICMP Source Quench Messages" (May 2012) This document [RFC6633] formally deprecates the use of ICMP Source Quench messages by transport protocols and recommends against the implementation of [RFC1016]. [3.3.] Loss Recovery Extensions For the typical implementation of the TCP fast recovery algorithm described in RFC 5681 (see Section 2 of this document), a TCP sender only retransmits a segment after a retransmit timeout has occurred, or after three duplicate ACKs have arrived triggering the fast retransmit. A single RTO might result in the retransmission of several segments, while the fast retransmit algorithm in RFC 5681 leads only to a single retransmission. Hence, multiple losses from a single window of data can lead to a performance degradation. Documents listed in this section aim to improve the overall performance of TCP's standard loss recovery algorithms. In particular, some of them allow TCP senders to recover more effectively when multiple segments are lost from a single flight of data. RFC 2018 S: "TCP Selective Acknowledgment Options" (October 1996) (Errata) When more than one packet is lost during one RTT, TCP may experience poor performance since a TCP sender can only learn about a single lost packet per RTT from cumulative acknowledgments. This document [RFC2018] defines the basic selective acknowledgment (SACK) mechanism for TCP, which can help to overcome these limitations. The receiving TCP returns SACK blocks to inform the sender which data has been received. The sender can then retransmit only the missing data segments. RFC 3042 S: "Enhancing TCP's Loss Recovery Using Limited Transmit" (January 2001) Abstract of RFC 3042 [RFC3042]: "This document proposes a new Transmission Control Protocol (TCP) mechanism that can be used to more effectively recover lost segments when a connection's congestion window is small, or when a large number of segments are lost in a single transmission window." This algorithm described in RFC 3042 is called "Limited Transmit". Limited Transmit is=20 recommended by RFC 5681 (see Section 2 of this document). RFC 6582 S: "The NewReno Modification to TCP's Fast Recovery Algorithm" (April 2012) This document [RFC6582] specifies a modification to the standard Reno fast recovery algorithm, whereby a TCP sender can use partial acknowledgments to make inferences determining the next segment to send in situations where SACK would be helpful but isn't available. Although it is only a slight modification, the NewReno behavior can make a significant difference in performance when multiple segments are lost from a single window of data. RFC 6675 S: "A Conservative Loss Recovery Algorithm Based on Selective Acknowledgment (SACK) for TCP" (August 2012) This document [RFC6675] describes a conservative loss recovery algorithm for TCP that is based on the use of the selective acknowledgment (SACK) TCP option [RFC2018] (see above in Section 3.3). The algorithm conforms to the spirit of the congestion control specification in RFC 5681 (see Section 2 of this document), but allows TCP senders to recover more effectively when multiple segments are lost from a single flight of data. RFC 6675 is a revision of RFC 3517 to address several situations that are not handled explicitly before. In particular, (a) it improves the loss detection in the event that the sender has outstanding segments that are smaller than Sender Maximum Segment Size (SMSS). (b) it modifies the definition of a "duplicate acknowledgment" to utilize the SACK information in detecting loss. (c) it maintains the ACK clock under certain circumstances involving loss at the end of the window. 3.4. Detection and Prevention of Spurious Retransmissions Spurious retransmission timeouts are harmful to TCP performance and multiple algorithms have been defined for detecting when spurious retransmissions have occurred, but they respond differently with regard to their manners of recovering performance. The IETF defined multiple algorithms because there are trade-offs in whether or not certain TCP options need to be implemented and concerns about IPR status. The Standards Track RFCs in this section are closely related to the Experimental RFCs in Section 4.5 also addressing this topic. RFC 2883 S: "An Extension to the Selective Acknowledgement (SACK) Option for TCP" (July 2000) This document [RFC2883] extends RFC 2018 (see Section 3.3 of this document). It enables use of the SACK option to acknowledge duplicate packets. With this extension, called DSACK, the sender is able to infer the order of packets received at the receiver and, therefore, to infer when it has unnecessarily retransmitted a packet. A TCP sender could then use this information to detect spurious retransmissions (see [RFC3708]). RFC 4015 S: "The Eifel Response Algorithm for TCP" (February 2005) Abstract of RFC 4015 [RFC4015]: "Based on an appropriate detection algorithm, the Eifel response algorithm provides a way for a TCP sender to respond to a detected spurious timeout. RFC 5682 S: "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting Spurious Retransmission Timeouts with TCP" (September 2009) The F-RTO detection algorithm [RFC5682], originally described in RFC 4138, provides an option for inferring spurious retransmission timeouts. Unlike some similar detection methods (e.g., RFCs 3522 and 3708, both listed in Section 4.5 of this document), F-RTO does not rely on the use of any TCP options. The basic idea is to send previously unsent data after the first retransmission after a RTO. If the ACKs advance the window, the RTO may be declared spurious. [3.5.] Path MTU Discovery RFC 1191 S: "Path MTU Discovery" (November 1990) RFC 1981 S: "Path MTU Discovery for IP version 6" (August 1996) RFC 4821 S: "Packetization Layer Path MTU Discovery" (March 2007) Abstract of RFC 4821 [RFC4821]: "This document describes a robust method for Path MTU Discovery (PMTUD) that relies on TCP or some other Packetization Layer to probe an Internet path with progressively larger packets. [3.6.] Header Compression Especially in streaming applications, the overhead of TCP/IP headers could correspond to more than 50% of the total amount of data sent. Such large overheads may be tolerable in wired LANs where capacity is often not an issue, but are excessive for WANs and wireless systems where bandwidth is scarce. Header compression schemes for TCP/IP like RObust Header Compression (ROHC) can significantly compress this overhead. It performs well over links with significant error rates and long round-trip times. RFC 1144 S: "Compressing TCP/IP Headers for Low-Speed Serial Links" (February 1990) RFC 6846 S: "RObust Header Compression (ROHC): A Profile for TCP/IP (ROHC-TCP)" (January 2013) 3.7. Defending Spoofing and Flooding Attacks By default, TCP lacks any cryptographic structures to differentiate legitimate segments from those spoofed from malicious hosts. Spoofing valid segments requires correctly guessing a number of fields. The documents in this subsection describe ways to make that guessing harder or to prevent it from being able to affect a connection negatively. RFC 4953 I: "Defending TCP Against Spoofing Attacks" (July 2007) RFC 4987 I: "TCP SYN Flooding Attacks and Common Mitigations" (August 2007) RFC 5925 S: "The TCP Authentication Option" (June 2010) RFC 5926 S: "Cryptographic Algorithms for the TCP Authentication Option (TCP-AO)" (June 2010) RFC 5927 I: "ICMP Attacks against TCP" (July 2010) RFC 5961 S: "Improving TCP's Robustness to Blind In-Window Attacks" (August 2010) RFC 6528 S: "Defending against Sequence Number Attacks" (February 2012) [4.] Experimental Extensions The RFCs in this section are either Experimental and may become Proposed Standards in the future or are Proposed Standards (or Informational), but can be considered experimental due to lack of wide deployment. At least part of the reason that they are still experimental is to gain more wide-scale experience with them before a standards track decision is made. [4.1.] Architectural Guidelines As multiple flows may share the same paths, sections of paths, or other resources, the TCP implementation may benefit from sharing information across TCP connections or other flows. Some experimental proposals have been documented and some implementations have included the concepts. RFC 2140 I: "TCP Control Block Interdependence" (April 1997) RFC 3124 S: "The Congestion Manager" (June 2001) This document [RFC3124] is a related proposal to RFC 2140 (see above in Section 4.1). The idea behind the Congestion Manager, moving congestion control outside of individual TCP connections, represents a modification to the core of TCP, which supports sharing information among TCP connections. Although a Proposed Standard, some pieces of the Congestion Manager support architecture have not been specified yet, and it has not achieved use or implementation beyond experimental stacks, so it is not listed among the standard TCP enhancements in this roadmap. [4.2.] Fundamental Changes Like the Standards Track documents listed in Section 3.1, there also exist new Experimental RFCs that specify fundamental changes to TCP. At the time of writing, the only example so far is TCP Fast Open that deviates from the standard TCP semantics of [RFC793]. RFC 7413 E: "TCP Fast Open" (December 2014) This document [RFC7413] describes TCP Fast Open that allows data to be carried in the SYN and SYN-ACK packets and consumed by the receiver during the initial connection handshake. [4.3.] Congestion Control Extensions TCP congestion control has been an extremely active research area for many years (see RFC 5783 discussed in Section 7.6 of this document), as it determines the performance of many applications that use TCP. A number of Experimental RFCs address issues with flow start up, overshoot, and steady-state behavior in the basic algorithms of RFC 5681 (see Section 2 of this document). In these subsections, enhancements to TCP's congestion control are listed.=20 RFC 2861 E: "TCP Congestion Window Validation" (June 2000) RFC 3540 E: "Robust Explicit Congestion Notification (ECN) Signaling with Nonces" (June 2003) RFC 3649 E: "HighSpeed TCP for Large Congestion Windows" (December 2003) RFC 3742 E: "Limited Slow-Start for TCP with Large Congestion Windows" (March 2004) RFC 4782 E: "Quick-Start for TCP and IP" (January 2007) (Errata) RFC 5562 E: "Adding Explicit Congestion Notification (ECN) Capability to TCP's SYN/ACK Packets" (June 2009) RFC 5690 I: "Adding Acknowledgement Congestion Control to TCP" (February 2010) RFC 6928 E: "Increasing TCP's Initial Window" (April 2013) This document [RFC6928] proposes to increase the TCP initial window from between 2 and 4 segments, as specified in RFC 3390 (see Section 3.2 of this document), to 10 segments with a fallback to the existing recommendation when performance issues are detected. [4.4.] Loss Recovery Extensions RFC 5827 E: "Early Retransmit for TCP and Stream Control Transmission Protocol (SCTP)" (April 2010) This document [RFC5827] proposes the "Early Retransmit" mechanism for TCP (and SCTP) that can be used to recover lost segments when a connection's congestion window is small. In certain special circumstances, Early Retransmit reduces the number of duplicate acknowledgments required to trigger fast retransmit to recover segment losses without waiting for a lengthy retransmission timeout. RFC 6069 E: "Making TCP More Robust to Long Connectivity Disruptions (TCP-LCD)" (December 2010) RFC 6937 E: "Proportional Rate Reduction for TCP" (May 2013) This document [RFC6937] describes an experimental Proportional Rate Reduction (PRR) algorithm as an alternative to the widely deployed Fast Recovery algorithm, to improve the accuracy of the amount of data sent by TCP during loss recovery. [4.5.] Detection and Prevention of Spurious Retransmissions In addition to the Standards Track extensions to deal with spurious retransmissions in Section 3.4, Experimental proposals have also been documented. RFC 3522 E: "The Eifel Detection Algorithm for TCP" (April 2003) RFC 3708 E: "Using TCP Duplicate Selective Acknowledgement (DSACKs) and Stream Control Transmission Protocol (SCTP) Duplicate Transmission Sequence Numbers (TSNs) to Detect Spurious Retransmissions" (February 2004) RFC 4653 E: "Improving the Robustness of TCP to Non-Congestion Events" (August 2006) [4.6.] TCP Timeouts RFC 5482 S: "TCP User Timeout Option" (March 2009) [4.7.] Multipath TCP MultiPath TCP (MPTCP) is an ongoing effort within the IETF that allows a TCP connection to simultaneously use multiple IP addresses / interfaces to spread their data across several subflows, while presenting a regular TCP interface to applications. Benefits of this include better resource utilization, better throughput and smoother reaction to failures. The documents listed in this section specify the Multipath TCP scheme, while the documents in Sections 7.2, 7.4, and 7.5 provide some additional background information. RFC 6356 E: "Coupled Congestion Control for Multipath Transport Protocols" (October 2011) RFC 6824 E: "TCP Extensions for Multipath Operation with Multiple Addresses" (January 2013) (Errata) [5.] TCP Parameters at IANA RFC 2780 B: "IANA Allocation Guidelines For Values In the Internet Protocol and Related Headers" (March 2000) RFC 4727 S: "Experimental Values in IPv4, IPv6, ICMPv4, ICMPv6, UDP, and TCP Headers" (November 2006) RFC 6335 B: "Internet Assigned Numbers Authority (IANA) Procedures for the Management of the Service Name and Transport Protocol Port Number Registry" (August 2011) RFC 6994 S: "Shared Use of Experimental TCP Options (August 2013) [7.] Support Documents This section contains several classes of documents that do not necessarily define current protocol behaviors but that are nevertheless of interest to TCP implementers. Section 7.1 describes several foundational RFCs that give modern readers a better understanding of the principles underlying TCP's behaviors and development over the years. Section 7.2 contains architectural guidelines and principles for TCP architects and designers. The documents listed in Section 7.3 provide advice on using TCP in various types of network situations that pose challenges above those of typical wired links. Guidance for developing, analyzing, and evaluating TCP is given in Section 7.4. Some implementation notes and implementation advice can be found in Section 7.5. RFCs that describe tools for testing and debugging TCP implementations or that contain high-level tutorials on the protocol are listed Section 7.6. The TCP Management Information Bases are described in Section 7.7, and Section 7.8 lists a number of case studies that have explored TCP performance. 7.4. Guidance for Developing, Analyzing, and Evaluating TCP Documents in this section give general guidance for developing, analyzing, and evaluating TCP. Some of the documents discuss, for example, the properties of congestion control protocols that are "safe" for Internet deployment as well as how to measure the properties of congestion control mechanisms and transport protocols. RFC 5033 B: "Specifying New Congestion Control Algorithms" (August 2007) This document [RFC5033] considers the evaluation of suggested congestion control algorithms that differ from the principles outlined in RFC 2914 (see Section 7.2 of this document). It is useful for authors of such algorithms as well as for IETF members reviewing the associated documents. RFC 5166 I: "Metrics for the Evaluation of Congestion Control Mechanisms" (March 2008) This document [RFC5166] discusses metrics that need to be considered when evaluating new or modified congestion control mechanisms for the Internet. Among other topics, the document discusses throughput, delay, loss rates, response times, fairness, and robustness for challenging environments. RFC 6077 I: "Open Research Issues in Internet Congestion Control" (February 2011) This document [RFC6077] summarizes the main open problems in the domain of Internet congestion control. As a good starting point for newcomers, the document describes several new challenges that are becoming important as the network grows, as well as some issues that have been known for many years. RFC 6181 I: "Threat Analysis for TCP Extensions for Multipath Operation with Multiple Addresses" (March 2011) This document [RFC6181] describes a threat analysis for Multipath TCP (MPTCP) (see Section 4.7 of this document). The document discusses several types of attacks and provides recommendations for MPTCP designers how to create an MPTCP specification that is as secure as the current (single-path) TCP. RFC 6349 I: "Framework for TCP Throughput Testing" (August 2011) From the Abstract of RFC 6349 [RFC6349]: "This framework describes a practical methodology for measuring end-to-end TCP Throughput in a managed IP network. The goal is to provide a better indication in regard to user experience. In this framework, TCP and IP parameters are specified to optimize TCP Throughput." 7.5. Implementation Advice RFC 794 U: "PRE-EMPTION" (September 1981) This document [RFC794] clarifies that operating systems need to manage their limited resources, which may include TCP connection state, and that these decisions can be made with application input, but they do not need to be part of the TCP protocol specification itself. RFC 879 U: "The TCP Maximum Segment Size and Related Topics" (November 1983) RFC 1071 U: "Computing the Internet Checksum" (September 1988) (Errata) RFC 1624 I: "Computation of the Internet Checksum via Incremental Update" (May 1994) RFC 1936 I: "Implementing the Internet Checksum in Hardware" (April 1996) RFC 2525 I: "Known TCP Implementation Problems" (March 1999) RFC 2923 I: "TCP Problems with Path MTU Discovery" (September 2000) RFC 3493 I: "Basic Socket Interface Extensions for IPv6" (February 2003) RFC 6056 B: "Recommendations for Transport-Protocol Port Randomization" (December 2010) RFC 6191 B: "Reducing the TIME-WAIT State Using TCP Timestamps" (April 2011) RFC 6429 I: "TCP Sender Clarification for Persist Condition" (December 2011) RFC 6897 I: "Multipath TCP (MPTCP) Application Interface Considerations" (March 2013) 7.6. Tools and Tutorials RFC 1180 I: "TCP/IP Tutorial" (January 1991) (Errata) This document [RFC1180] is an extremely brief overview of the TCP/ IP protocol suite as a whole. It gives some explanation as to how and where TCP fits in. RFC 1470 I: "FYI on a Network Management Tool Catalog: Tools for Monitoring and Debugging TCP/IP Internets and Interconnected Devices" (June 1993) A few of the tools that this document [RFC1470] describes are still maintained and in use today, for example, ttcp and tcpdump. However, many of the tools described do not relate specifically to TCP and are no longer used or easily available. RFC 2398 I: "Some Testing Tools for TCP Implementors" (August 1998) This document [RFC2398] describes a number of TCP packet generation and analysis tools. Although some of these tools are no longer readily available or widely used, for the most part they are still relevant and usable. RFC 5783 I: "Congestion Control in the RFC Series" (February 2010) This document [RFC5783] provides an overview of RFCs related to congestion control that had been published at the time. The focus of the document is on end-host-based congestion control. 8. Undocumented TCP Features There are a few important implementation tactics for the TCP that have not yet been described in any RFC. Although this roadmap is primarily concerned with mapping the TCP RFCs, this section is included because an implementer needs to be aware of these important issues. Header Prediction Header prediction is a trick to speed up the processing of segments. Van Jacobson and Mike Karels developed the technique in the late 1980s. The basic idea is that some processing time can be saved when most of a segment's fields can be predicted from previous segments. A good description of this was sent to the TCP-IP mailing list by Van Jacobson on March 9, 1988 (see [Jacobson] for the full message): Quite a bit of the speedup comes from an algorithm that we ('we' refers to collaborator Mike Karels and myself) are calling "header prediction". The idea is that if you're in the middle of a bulk data transfer and have just seen a packet, you know what the next packet is going to look like: It will look just like the current packet with either the sequence number or ack number updated (depending on whether you're the sender or receiver). Combining this with the "Use hints" epigram from Butler Lampson's classic "Epigrams for System Designers", you start to think of the tcp state (rcv.nxt, snd.una, etc.) as "hints" about what the next packet should look like. If you arrange those "hints" so they match the layout of a tcp packet header, it takes a single 14-byte compare to see if your prediction is correct (3 longword compares to pick up the send & ack sequence numbers, header length, flags and window, plus a short compare on the length). If the prediction is correct, there's a single test on the length to see if you're the sender or receiver followed by the appropriate processing. E.g., if the length is non-zero (you're the receiver), checksum and append the data to the socket buffer then wake any process that's sleeping on the buffer. Update rcv.nxt by the length of this packet (this updates your "prediction" of the next packet). Check if you can handle another packet the same size as the current one. If not, set one of the unused flag bits in your header prediction to guarantee that the prediction will fail on the next packet and force you to go through full protocol processing. Otherwise, you're done with this packet. So, the *total* tcp protocol processing, exclusive of checksumming, is on the order of 6 compares and an add. Forward Acknowledgement (FACK) FACK [MM96] includes an alternate algorithm for triggering fast retransmit [RFC5681], based on the extent of the SACK scoreboard. Its goal is to trigger fast retransmit as soon as the receiver's reassembly queue is larger than the duplicate ACK threshold, as indicated by the difference between the forward most SACK block edge and SND.UNA. This algorithm quickly and reliably triggers fast retransmit in the presence of burst losses -- often on the first SACK following such a loss. Such a threshold-based algorithm also triggers fast retransmit immediately in the presence of any reordering with extent greater than the duplicate ACK threshold. FACK is implemented in Linux and turned on per default. Congestion Control for High Rate Flows In the last decade significant research effort has been put into experimental TCP congestion control modifications for obtaining high throughput with reduced startup and recovery times. Only a few RFCs have been published on some of these modifications, including HighSpeed TCP [RFC3649], Limited Slow-Start [RFC3742], and Quick-Start [RFC4782] (see Section 4.3 of this document for more information on each), but high-rate congestion control mechanisms are still considered an open issue in congestion control research. Some other schemes have been published as Internet-Drafts, e.g. CUBIC [CUBIC] (the standard TCP congestion control algorithm in Linux), Compound TCP [CTCP], and H-TCP [HTCP] or have been discussed a little by the IETF, but much of the work in this area has not been adopted within the IETF yet, so the majority of this work is outside the RFC series and may be discussed in other products of the IRTF Internet Congestion Control Research Group (ICCRG). Metrics for the Evaluation of Congestion Control Mechanisms https://tools.ietf.org/html/rfc5166 Discusses the metrics to be considered in an evaluation of new or modified congestion control mechanisms for the Internet. These include metrics for the evaluation of new transport protocols, of proposed modifications to TCP, of application-level congestion control, and of Active Queue Management (AQM) mechanisms in the router. This document is the first in a series of documents aimed at improving the models that we use in the evaluation of transport protocols. Types Of Metrics: - Throughput, Delay, and Loss Rates - Throughput: can be measured as - router-based metric of aggregate link utilization - flow-based metric of per-connection transfer times - user-based metric of utility functions or user wait times - Goodput: sometimes distinguished from throughput where throughput is the link utilization or flow rate in bytes per second; goodput is the subset of throughput (also measured in Bytes/s) consisting of useful traffic [i.e. excluding duplicate packets] - Delay: Like throughput, delay can be measured as a router-based metri= c of queueing delay over time, or as a flow-based metric in terms of per-packet transfer times. Per-packet delay can also include delay at the sender waiting for the transport protocol to send the packet. For reliable transfer, the per-packet transfer time seen by the application includes the possible delay of retransmitting a lost packet. - Packet Loss Rates: can be measured as a network-based or as a flow-based metric. One network-related reason to avoid high steady- state packet loss rates is to avoid congestion collapse in environmen= ts=20 containing paths with multiple congested links - Response Times and Minimizing Oscillations =20 - Response to Changes: One of the key concerns in the design of congest= ion=20 control mechanisms has been the response times to sudden congestion i= n the network. On the one hand, congestion control mechanisms should respond reasonably promptly to sudden congestion from routing or bandwidth changes or from a burst of competing traffic. At the same time, congestion control mechanisms should not respond too severely to transient changes, e.g., to a sudden increase in delay that will dissipate in less than the connection's round-trip time. - Minimizing Oscillations: One goal is that of stability, in terms of= =20 minimizing oscillations of queueing delay or of throughput. In pract= ice,=20 stability is frequently associated with rate fluctuations or variance= . =20 Rate variations can result in fluctuations in router queue size and therefore of queue overflows. These queue overflows can cause loss synchronizations across coexisting flows and periodic under-utilizati= on=20 of link capacity, both of which are considered to be general signs of= =20 network instability. Thus, measuring the rate variations of flows is= =20 often used to measure the stability of transport protocols. To measu= re=20 rate variations, [JWL04], [RX05], and [FHPW00] use the coefficient of= =20 variation (CoV) of per-flow transmission rates, and [WCL05] suggests = the use of standard deviations of per-flow rates. Since rate variations = are=20 a function of time scales, it makes sense to measure these rate varia= tions over various time scales. - Fairness and Convergence - Fairness between Flows: let x_i be the throughput for the i-th connec= tion. - Jain's fairness index: The fairness index in [JCH84] is: =20 =09(( sum_i x_i )^2) / (n * sum_i ( (x_i)^2 )), =09where there are n users. This fairness index ranges from 0 to 1, = and =09it is maximum when all users receive the same allocation. This in= dex =09is k/n when k users equally share the resource, and the other n-k =09users receive zero allocation. - The product measure: =09product_i x_i =09the product of the throughput of the individual connections, is also =09used as a measure of fairness. (In some contexts x_i is taken as the =09power of the i-th connection, and the product measure is referred to =09as network power.) The product measure is particularly sensitive to =09segregation; the product measure is zero if any connection receives =09zero throughput. [N.B. If one normalizes to actual bandwidth by takin= g=20 =09the Nth root of the product, where N =3D number of connections, this is =09the geometric mean. The geometric mean will be less than the arithmetic =09mean unless all flows have equivalent throughput.] - Epsilon-fairness: A rate allocation is defined as epsilon-fair if (min_i x_i) / (max_i x_i) >=3D 1 - epsilon =20 Epsilon-fairness measures the worst-case ratio between any two throu= ghput rates [ZKL04]. Epsilon-fairness is related to max-min fairness. - Fairness between Flows with Different Resource Requirements - Max-min fairness: In order to satisfy the max-min fairness criteria= , =09the smallest throughput rate must be as large as possible. Given =09this condition, the next-smallest throughput rate must be as large as =09possible, and so on. Thus, the max-min fairness gives absolute =09priority to the smallest flows. (Max-min fairness can be explained =09by the progressive filling algorithm, where all flow rates start at =09zero, and the rates all grow at the same pace. Each flow rate stops =09growing only when one or more links on the path reach link capacity.) - Proportional fairness: A feasible allocation, x, is defined as proportionally fair if, for any other feasible allocation x*, the aggregate of proportional changes is zero or negative: =09 sum_i ( (x*_i - x_i)/x_i ) <=3D 0. "This criterion favours smaller flows, but less emphatically than max-min fairness" [K01]. (Using the language of utility functions, proportional fairness can be achieved by using logarithmic utility functions, and maximizing the sum of the per-flow utility functions; see [KMT98] for a fuller explanation.) - Minimum potential delay fairness: Minimum potential delay fairness has been shown to model TCP [KS03], and is a compromise between max-min fairness and proportional fairness. An allocation, x, is defined as having minimum potential delay fairness if: sum_i (1/x_i) is smaller than for any other feasible allocation. That is, it woul= d minimize the average download time if each flow was an equal-sized file. - Comments on Fairness - Trade-offs between fairness and throughput: The fairness measures i= n =09the section above generally measure both fairness and throughput, =09giving different weights to each. Potential trade-offs between =09fairness and throughput are also discussed by Tang, et al. in =09[TWL06], for a framework where max-min fairness is defined as the =09most fair. In particular, [TWL06] shows that in some topologies, =09throughput is proportional to fairness, while in other topologies, =09throughput is inversely proportional to fairness. - Fairness and the number of congested links: Some of these fairness metrics are discussed in more detail in [F91]. We note that there i= s not a clear consensus for the fairness goals, in particular for fairness between flows that traverse different numbers of congested links [F91]. Utility maximization provides one framework for describing this trade-off in fairness. - Fairness and round-trip times: One goal cited in a number of new transport protocols has been that of fairness between flows with different round-trip times [KHR02] [XHR04]. We note that there is not a consensus in the networking community about the desirability o= f this goal, or about the implications and interactions between this goal and other metrics [FJ92] (Section 3.3). One common argument against the goal of fairness between flows with different round-trip times has been that flows with long round-trip times consume more resources; this aspect is covered by the previous paragraph. Researchers have also noted the difference between the RTT-unfairnes= s of standard TCP, and the greater RTT-unfairness of some proposed modifications to TCP [LLS05]. - Fairness and packet size: One fairness issue is that of the relative fairness for flows with different packet sizes. Many file transfer applications will use the maximum packet size possible; in contrast= , low-bandwidth VoIP flows are likely to send small packets, sending a new packet every 10 to 40 ms., to limit delay. Should a small-packe= t VoIP connection receive the same sending rate in *bytes* per second as a large-packet TCP connection in the same environment, or should it receive the same sending rate in *packets* per second? This fairness issue has been discussed in more detail in [RFC3714], with [RFC4828] also describing the ways that packet size can affect the packet drop rate experienced by a flow. - Convergence times: Convergence times concern the time for convergenc= e to fairness between an existing flow and a newly starting one, and are a special concern for environments with high-bandwidth long-dela= y flows. Convergence times also concern the time for convergence to fairness after a sudden change such as a change in the network path, the competing cross-traffic, or the characteristics of a wireless link. As with fairness, convergence times can matter both between flows of the same protocol, and between flows using different protocols [SLFK03]. One metric used for convergence times is the delta-fair convergence time, defined as the time taken for two flows with the same round-trip time to go from shares of 100/101-th and 1/101-th of the link bandwidth, to having close to fair sharing with shares of (1+delta)/2 and (1-delta)/2 of the link bandwidth [BBFS01]= . A similar metric for convergence times measures the convergence time as the number of round-trip times for two flows to reach epsilon- fairness, when starting from a maximally-unfair state [ZKL04]. TCP Congestion Control (RFC 5681): http://www.rfc-editor.org/rfc/rfc5681.txt Specifies four TCP congestion algorithms: slow start, congestion avoidance, fast retransmit and fast recovery. They were devised in [Jac88] and [Jac90]. Their use with TCP is standardized in=20 [RFC1122]. In addition the document specifies what TCP connections should do after a relatively long idle period, as well as clarifying some of the issues pertaining to TCP ACK generation. Obsoletes [RFC2581], which in turn obsoleted [RFC2001]. The slow start and congestion avoidance algorithms MUST be used by the=20 TCP sender to control the amount of outstanding data being injected into the network. These add three state variables. - Congestion Window (cwnd): a sender-side limit on the amount of data= =20 the sender can transmit before receiving an ACK. - Receiver's Advertised Window (rwnd): a receiver-side limit o the amo= unt=20 of outstanding data.=20 - Slow Start Threshold (ssthresh): used to determine whether the slow s= tart=20 or congestion avoidance algorithm is used to control data transmissio= n. Slow Start: Used to determine available link capacity at the beginning of a transfer, after repairing loss detected by the retransmission timer, or=20 [potentially] after a long idle period. It is additionally used to start th= e=20 "ACK clock". - SMSS: Sender Maximum Segment Size - IW: Initial Window, the initial value of cwnd, MUST be set using the= =20 following guidelines as an upper bound =20 If SMSS > 2190 bytes: =09IW =3D 2 * SMSS bytes and MUST NOT be more than 2 segments If (SMSS > 1095 bytes) and (SMSS <=3D 2190 bytes): =09IW =3D 3 * SMSS bytes and MUST NOT be more than 3 segments If SMSS <=3D 1095 bytes: =09IW =3D 4 * SMSS bytes and MUST NOT be more than 4 segments - Ssthresh:=20 - SHOULD be set arbitrarily high (e.g., to the size of the largest=20 =09possible advertised window), but ssthresh MUST be reduced in respo= nse =09to congestion. - The slow start algorithm is used when cwnd < ssthresh, while the =09congestion avoidance algorithm is used when cwnd > ssthresh. When =09cwnd and ssthresh are equal, the sender may use either slow start or =09congestion avoidance. - When a TCP sender detects segment loss using the retransmission tim= er =09and the given segment has not yet been resent once by way of the =09retransmission timer, the value of ssthresh MUST be set to no more =09than the value given in equation (4): =09ssthresh =3D max (FlightSize / 2, 2*SMSS) (4) =09Where Flightsize is the amount of outstanding data in the network. =20 - Growing cwnd: During slow start, a TCP increments cwnd by at most SMS= S=20 bytes for each ACK received that cumulatively acknowledges new data.= =20 Slow start ends when cwnd reaches or exceeds ssthresh. =20 - Traditionally TCP implementations have increased cwnd by precisely =09SMSS bytes upon receipt of an ACK covering new data, we RECOMMEND =09that TCP implementations increase cwnd, per:=20 =09cwnd +=3D min (N, SMSS) (2) =09where N is the number of previously unacknowledged bytes acknowledged =09in the incoming ACK. Congestion Avoidance: during congestion avoidance, cwnd is incremented by roughly 1 full-sized segment per RTT. Congestion avoidance continues until congestion is detected. The basic guidelines for incrementing cwnd are: - MAY increment cwnd by SMSS bytes - SHOULD increment cwnd per equation (2) once per RTT - MUST NOT increment cwnd by more than SMSS bytesb [RFC3465] allows for cwnd increases of more than SMSS bytes for incoming=20 acknowledgments during slow start on an experimental basis; however, such= =20 behavior is not allowed as part of the standard. Another common formula that a TCP MAY use to update cwnd during congestion avoidance is given in equation (3): cwnd +=3D SMSS*SMSS/cwnd (3) This adjustment is executed on every incoming ACK that acknowledges new data. Equation (3) provides an acceptable approximation to the underlying principle of increasing cwnd by 1 full-sized segment per RTT. Upon a timeout (as specified in [RFC2988]) cwnd MUST be set to no more than the loss window, LW, which equals 1 full-sized segment (regardless of the value of IW). Therefore, after retransmitting the dropped segment the TCP sender uses the slow start algorithm to increase the window from 1 full-sized segment to the new value of ssthresh, at which point congestion avoidance again takes over. Fast Retransmit/Fast Recovery: A TCP receiver SHOULD send an immediate=20 duplicate ACK when an out-of-order segment arrives. The purpose of this AC= K=20 is to inform the sender that a segment was received out-of-order and which= =20 sequence number is expected. In addition, a TCP receiver SHOULD send an=20 immediate ACK when the incoming segment fills in all or part of a gap in th= e=20 sequence space. This will generate more timely information for a sender recovering from a loss through a retransmission timeout, a fast retransmit,= or an advanced loss recovery algorithm. The TCP sender SHOULD use the "fast retransmit" algorithm to detect and rep= air loss, based on incoming duplicate ACKs. The fast retransmit algorithm uses= the arrival of 3 duplicate ACKs as an indication that a segment has been lost. TCP then performs a retransmission of what appears to be the missing segmen= t,=20 without waiting for the retransmission timer to expire. The fast retransmit and fast recovery algorithms are implemented together as follows. 1. On the first and second duplicate ACKs received at a sender, a TCP SHOULD send a segment of previously unsent data per [RFC3042] provided that the receiver's advertised window allows, the total FlightSize would remain less than or equal to cwnd plus 2*SMSS, and that new data is available for transmission. Further, the TCP sender MUST NOT change cwnd to reflect these two segments [RFC3042]. Note that a sender using SACK [RFC2018] MUST NOT send new data unless the incoming duplicate acknowledgment contains new SACK information. 2. When the third duplicate ACK is received, a TCP MUST set ssthresh to no more than the value given in equation (4). When [RFC3042] is in use, additional data sent in limited transmit MUST NOT be included in this calculation. 3. The lost segment starting at SND.UNA MUST be retransmitted and cwnd set to ssthresh plus 3*SMSS. This artificially "inflates" the congestion window by the number of segments (three) that have left the network and which the receiver has buffered. 4. For each additional duplicate ACK received (after the third), cwnd MUST be incremented by SMSS. This artificially inflates the congestion window in order to reflect the additional segment that has left the network. Note: [SCWA99] discusses a receiver-based attack whereby many bogus duplicate ACKs are sent to the data sender in order to artificially inflate cwnd and cause a higher than appropriate sending rate to be used. A TCP MAY therefore limit the number of times cwnd is artificially inflated during loss recovery to the number of outstanding segments (or, an approximation thereof). Note: When an advanced loss recovery mechanism (such as outlined in section 4.3) is not in use, this increase in FlightSize can cause equation (4) to slightly inflate cwnd and ssthresh, as some of the segments between SND.UNA and SND.NXT are assumed to have left the network but are still reflected in FlightSize. 5. When previously unsent data is available and the new value of cwnd and the receiver's advertised window allow, a TCP SHOULD send 1*SMSS bytes of previously unsent data. 6. When the next ACK arrives that acknowledges previously unacknowledged data, a TCP MUST set cwnd to ssthresh (the value set in step 2). This is termed "deflating" the window. This ACK should be the acknowledgment elicited by the retransmission from step 3, one RTT after the retransmission (though it may arrive sooner in the presence of significant out- of-order delivery of data segments at the receiver). Additionally, this ACK should acknowledge all the intermediate segments sent between the lost segment and the receipt of the third duplicate ACK, if none of these were lost. Note: This algorithm is known to generally not recover efficiently from multiple losses in a single flight of packets=20 RTO: https://tools.ietf.org/html/rfc6298 Does not modify the behaviour in RFC 5681. The RTO is a function of two state variables, SRTT and RTTVAR. The following constants are used for calculations: =09G <- clock granularity in seconds =09K <- 4 [(2.1)] Until a round-trip time (RTT) measurment has been made for a segmen= t sent between the sender and the receiver, the sender SHOULD set RTO <- 1 se= cond, [i.e. not the outdated 3s currently in FreeBSD] - the "backing off" on repe= ated=20 retransmission still applies. [(2.2)] When the first RTT measurement R is made, the host MUST set =09SRTT <- R =09RTTVAR <- R/2 =09RTO <- SRTT + max (G, K*RTTVAR) [(2.3)] When a subsequent RTT measurement R' is made, a host must set =09RTTVAR <- (1 - beta)*RTTVAR + beta * |SRTT - R'| =09SRTT <- (1 - alpha)*SRTT + alpha*R' The value of SRTT used in updating RTTVAR is the one prior to the update in the second assignment - i.e. the updates are done RTTVAR then SRTT. The above calculation SHOULD be done with alpha=3D1/8 and beta=3D1/4 (as suggested in [JK88]). [N.B. Should these values be smaller in the data center so that the SRTT maintains a longer memory and isn't compromised by a transient microburst?]. [(2.4)] Whenever RTO is computed, if it is less than 1 second, then the RTO SHOULD be rounded up to 1 second. [See the incast section =09 for why this is unequivocally wrong in the data center] Traditionally, TCP implementations use coarse grain clocks to measure the RTT and trigger the RTO, which imposes a large minimum value on the RTO. Research suggests that a large minimum RTO is needed to keep TCP conservative and avoid spurious retransmissions [AP99]. Therefore, this specification requires a large minimum RTO as a conservative approach, while =09 at the same time acknowledging that at some future point, research may show that a smaller minimum RTO is acceptable or superior. [Vasudevan09 (incast section) clearly shows this to =09 be the case.] Note that a TCP implementation MAY clear SRTT and RTTVAR after backing off the timer multiple times as it is likely that the current SRTT and RTTVAR are bogus in this situation. Once SRTT and RTTVAR are cleared, they should be initialized with the next RTT sample taken per (2.2) rather than using (2.3). [(7)] Changes from RFC 2988 This document reduces the initial RTO from the previous 3 seconds [PA00] to 1 second, unless the SYN or the ACK of the SYN is lost, in which case the default RTO is reverted to 3 seconds before data transmission begins. Increasing TCP's intial window: http://www.rfc-editor.org/rfc/rfc3390.txt http://www.rfc-editor.org/rfc/rfc6928.txt Proposes an experiment to increase the permitted TCP initial window (IW) from between 2 and 4 segments, as specified in RFC 3390, to 10 segments with a fallback to the existing recommendation when performance issues are detected. It discusses the motivation behind the increase, the advantages and disadvantages of the higher initial window, and presents results from several large-scale experiments showing that the higher initial window improves the overall performance of many web services without resulting in a congestion collapse.=20 TCP Modification:=20 - The upper bound for the initial window will be:=20 =20 =09min (10*MSS, max (2*MSS, 14600)) - This change applies to the initial window of the connection in the first round-trip time (RTT) of data transmission during or following the TCP three-way handshake. - all the test results described in this document were based on the regular Ethernet MTU of 1500 bytes. Future study of the effect of a different MTU may be needed to fully validate (1) above. - [In contrast to RFC 3390 and RFC 5681] The proposed change to reduce = the=20 default retransmission timeout (RTO) to 1 second [RFC6298] increases = the=20 chance for spurious SYN or SYN/ACK retransmission, thus unnecessarily= =20 penalizing connections with RTT > 1 second if their initial window is= =20 reduced to 1 segment. For this reason, it is RECOMMENDED that=20 implementations refrain from resetting the initial window to 1 segmen= t,=20 unless there have been more than one SYN or SYN/ACK retransmissions o= r=20 true loss detection has been made. - TCP implementations use slow start in as many as three different ways: (1) to start a new connection (the initial window); (2) to restart transmission after a long idle period (the restart window); and (3) to restart transmission after a retransmit timeout (the loss window). The change specified in this document affects the value of the initial window. Optionally, a TCP MAY set the restart window to the minimum of the value used for the initial window and the current value of cwnd (in other words, using a larger value for the restart window should never increase the size of cwnd). These changes do NOT change the loss window, which must remain 1 segment of MSS bytes (to permit the lowest possible window size in the case of severe congesti= on). - To limit any negative effect that a larger initial window may have on links with limited bandwidth or buffer space, implementations SHOULD fall back to RFC 3390 for the restart window (RW) if any packet loss is detected during either the initial window or a restart window, and more than 4 KB of data is sent. 4. Background - According to the latest report from Akamai [AKAM10], the global broadband (> 2 Mbps) adoption has surpassed 50%, propelling the average connection speed to reach 1.7 Mbps, while the narrowband (< 256 Kbps) usage has dropped to 5%. In contrast, TCP's initial window has remained 4 KB for a decade [RFC2414], corresponding to a bandwidth utilization of less than 200 Kbps per connection, assuming an RTT of 200 ms. - A large proportion of flows on the Internet are short web transactions over TCP and complete before exiting TCP slow start. - applications have responded to TCP's "slow" start. Web sites use multiple subdomains [Bel10] to circumvent HTTP 1.1 regulation on two connections per physical host [RFC2616]. As of today, major web browsers open multiple connections to the same site (up to six connections per domain [Ste08] and the number is growing). This trend is to remedy HTTP serialized download to achieve parallelism and higher performance. But it also implies that today most access links are severely under-utilized, hence having multiple TCP connections improves performance most of the time. - persistent connections and pipelining are designed to address some of the above issues with HTTP [RFC2616]. Their presence does not diminish the need for a larger initial window, e.g., data from the Chrome browser shows that 35% of HTTP requests are made on new TCP connections. Our test data also shows significant latency reduction with the large initial window even in conjunction with these two HTTP features [Duk10]. 5. Advantages of Larger Initial Windows - Reducing Latency An increase of the initial window from 3 segments to 10 segments reduces the total transfer time for data sets greater than 4 KB by up to 4 round trips. The table below compares the number of round trips between IW=3D3 and IW=3D10 for different transfer sizes, assuming infinite bandwidth, no packet loss, and the standard delayed ACKs with large delayed-ACK timer. --------------------------------------- | total segments | IW=3D3 | IW=3D10 | --------------------------------------- | 3 | 1 | 1 | | 6 | 2 | 1 | | 10 | 3 | 1 | | 12 | 3 | 2 | | 21 | 4 | 2 | | 25 | 5 | 2 | | 33 | 5 | 3 | | 46 | 6 | 3 | | 51 | 6 | 4 | | 78 | 7 | 4 | | 79 | 8 | 4 | | 120 | 8 | 5 | | 127 | 9 | 5 | --------------------------------------- For example, with the larger initial window, a transfer of 32 segments of data will require only 2 rather than 5 round trips to complete. - Recovering Faster from Loss on Under-Utilized or Wireless Links A greater-than-3-segment initial window increases the chance to recover packet loss through Fast Retransmit rather than the lengthy initial RTO [RFC5681]. This is because the fast retransmit algorithm requires three duplicate ACKs as an indication that a segment has been lost rather than reordered. While newer loss recovery techniques such as Limited Transmit [RFC3042] and Early Retransmit [RFC5827] have been proposed to help speeding up loss recovery from a smaller window, both algorithms can still benefit from the larger initial window because of a better chance to receive more ACKs. 8. Mitigation of Negative Impact Much of the negative impact from an increase in the initial window is likely to be felt by users behind slow links with limited buffers. The negative impact can be mitigated by hosts directly connected to a low-speed link advertising an initial receive window smaller than 10 segments. This can be achieved either through manual configuration by the users or through the host stack auto-detecting the low- bandwidth links. Additional suggestions to improve the end-to-end performance of slow links can be found in RFC 3150 [RFC3150]. RTO & High Performance: https://tools.ietf.org/html/rfc7323 Updates the venerable RFC 1361. [Also in RFC1361] An additional mechanism could be added to the TCP, a per-host cache of the last timestamp received from any connection. This value could then be used in the PAWS mechanism to reject old duplicate segments from earlier incarnations of the connection, if the timestamp clock can be guaranteed to have ticked at least once since the old connection was open. This would require that the TIME-WAIT delay plus the RTT together must be at least one tick of the sender's timestamp clock. Such an extension is not part of the proposal of this RFC. Appendix G. RTO Calculation Modification Taking multiple RTT samples per window would shorten the history calculated by the RTO mechanism in [RFC6298], and the below algorithm aims to maintain a similar history as originally intended by [RFC6298].=20 It is roughly known how many samples a congestion window worth of data will yield, not accounting for ACK compression, and ACK losses. Such events will result in more history of the path being reflected in the final value for RTO, and are uncritical. This modification will ensure that a similar amount of time is taken into account for the RTO estimation, regardless of how many samples are taken per window: ExpectedSamples =3D ceiling(FlightSize / (SMSS * 2)) alpha' =3D alpha / ExpectedSamples beta' =3D beta / ExpectedSamples Note that the factor 2 in ExpectedSamples is due to "Delayed ACKs". Instead of using alpha and beta in the algorithm of [RFC6298], use alpha' and beta' instead: RTTVAR <- (1 - beta') * RTTVAR + beta' * |SRTT - R'| SRTT <- (1 - alpha') * SRTT + alpha' * R' (for each sample R') =20 Appendix H. Changes from RFC 1323 Several important updates and clarifications to the specification in RFC 1323 are made in this document. The [important] technical changes a= re summarized below: (d) The description of which TSecr values can be used to update the measured RTT has been clarified. Specifically, with timestamps, the Karn algorithm [Karn87] is disabled. The Karn algorithm disables all RTT measurements during retransmission, since it is ambiguous whether the is for the original segment, or the retransmitted segment. With timestamps, that ambiguity is removed since the TSecr in the will contain the TSval from whichever data segment made it to the destination. (e) RTTM update processing explicitly excludes segments not updating SND.UNA. The original text could be interpreted to allow taking RTT samples when SACK acknowledges some new, non-continuous data. (f) In RFC 1323, Section 3.4, step (2) of the algorithm to control which timestamp is echoed was incorrect in two regards: (1) It failed to update TS.Recent for a retransmitted segment that resulted from a lost . (2) It failed if SEG.LEN =3D 0. In the new algorithm, the case of SEG.TSval >=3D TS.Recent is included for consistency with the PAWS test. (g) It is now recommended that the Timestamps option is included in segments if the incoming segment contained a Timestamps option. (h) segments are explicitly excluded from PAWS processing. (j) Snd.TSoffset and Snd.TSclock variables have been added. Snd.TSclock is the sum of my.TSclock and Snd.TSoffset. This allows the starting points for timestamp values to be randomized on a per-connection basis. Setting Snd.TSoffset to zero yields the same results as [RFC1323]. Text was added to guide implementers to the proper selection of these offsets, as entirely random offsets for each new connection will conflict with PAWS. Congestion Window Validation (CWV): http://www.ietf.org/proceedings/69/slides/tcpm-7.pdf https://tools.ietf.org/html/rfc7661 Provides a mechanism to address issues that arise when TCP is used for traffic that exhibits periods where the sending rate is limited by the application rather than the congestion window. This=20 RFC provides an experimental update to TCP that allows a TCP sender to restart quickly following a rate-limited interval. This method is expected to benefit applications that send rate-limited traffic using TCP while also providing an appropriate response if congestion is experienced. Motivation: Standard TCP states that a TCP sender SHOULD set cwnd to no more than the Restart Window (RW) before beginning transmission if the TCP sender has not sent data in an interval exceeding the retransmission timeout, i.e., when an application becomes idle [RFC5681]. [RFC2861] notes that this TCP behaviour was not always observed in current implementations. Experiments confirm this to still be the case (see [Bis08]). Congestion Window Validation (CWV) [RFC2861] introduced the term "application-limited period" for the time when the sender sends less than is allowed by the congestion or receiver windows. Standard TCP does not impose additional restrictions on the growth of the congestion window when a TCP sender is unable to send at the maximum rate allowed by the cwnd. In this case, the rate-limited sender may grow a cwnd far beyond that corresponding to the current transmit rate, resulting in a value that does not reflect current information about the state of the network path the flow is using. Use of such an invalid cwnd may result in reduced application performance and/or could significantly contribute to network congestion. Active Queue Management (AQM): Active Queue Management is an effort to avoid the latency increases (and in= crease in time in the=20 feedback loop) and bursty losses caused by naive tail drop in intermediate = buffering. The concept was introduced along with a discussion of the queue management algorithm "R= ED" (Random Early=20 Detect/Drop) by RFC 2309. The most current RFC is 7567. The usual mix of long high throughput and short low latency flows place con= flicting demands on=20 the queue occupancy of a switch: o The queue must be short enough that it does not impose excessive latency on short flows. o The queue must be long enough to buffer sufficient data for the long flows to saturate the path capacity. o The queue must be short enough to absorb incast bursts without excessive packet loss. =20 RED: The RED algorithm itself consists of two main parts: estimation of the average queue size and the decision of whether or not to drop an incoming packet. (a) Estimation of Average Queue Size RED estimates the average queue size, either in the forwarding path using a simple exponentially weighted moving average (such as presented in Appendix A of [Jacobson88]), or in the background (i.e., not in the forwarding path) using a similar mechanism. (b) Packet Drop Decision In the second portion of the algorithm, RED decides whether or not to drop an incoming packet. It is RED's particular algorithm for dropping that results in performance improvement for responsive flows. Two RED parameters, minth (minimum threshold) and maxth (maximum threshold), figure prominently in this decision process. Minth specifies the average queue size *below which* no packets will be dropped, while maxth specifies the average queue size *above which* all packets will be dropped. As the average queue size varies from minth to maxth, packets will be dropped with a probability that varies linearly from 0 to maxp. Recommendations on Queue Management and Congestion Avoidance in the Internet https://tools.ietf.org/html/rfc2309 IETF Recommendations Regarding Active Queue Management https://tools.ietf.org/html/rfc7567 https://en.wikipedia.org/wiki/Active_queue_management Explicit Congestion Notification (ECN): At its core ECN in TCP allows compliant routers to provide compliant sender= s with notification of "virtual drops" as a congestion indicator to halve its congestion window= . This allows the=20 sender to not wait for the retransmit timeout or repeated ACKS to learn of = a congestion=20 event and allows the receiver to avoid latency induced by drop/retransmit. = ECN relies on some=20 form of AQM in the intermediate routers/switches to determine the marking t= he CE (congestion encountered) bit IP header, it is then the receiver's responsibility to mar= k the ECE (ECN-Echo)=20 in the TCP header of the subsequent ACK. The receiver will continue to send= packets marked with=20 the ECE bit until it receives a packet with the CWR (Congestion Window Redu= ced) bit set. Note=20 that although this last design decision makes it robust in the presence of = ack loss (the=20 original version ECN specifies that ACKs / SYNs / SYN-ACKs not be marked as= ECN capable and=20 thus are not eligible for marking), it limits the use of ECN to once per RT= T. As we'll see later this leads to interoperability issues with DCTCP. ECN is negotiated at connection time. In FreeBSD it is configured by a sysc= tl defaulting to off for all connections. Enabling the sysctl enables it for all connections. Th= e last time a survey=20 was done, 2.7% of the internet would not respond to a SYN negotiating ECN. = This isn't fatal as=20 subsequent SYNs will switch to not requesting ECN. This just adds the defau= lt RTO to connection establishment (3s in FreeBSD, 1s per RFC6298 - discussed later). Linux has some very common sense configurability improvements. Its ECN knob= takes on _3_ values: 0) no request / no accept 1) no request / accept 2) request / accept. The d= efault is (1),=20 supporting it for those adventurous enough to request it. The route command= can specify ECN by subnet. In effect allowing servers / clients to only use it within a data c= enter or between=20 compliant data centers. ECN sees very little usage due to continued compatibility concerns. Althoug= h the difficulty of correctly tuning maxth and minth in RED and many other AQM mechanisms is no= t specific to ECN,=20 RED et al are necessary to use ECN and thus further add to associated diffi= culties of its use. Talks: More Accurate ECN Feedback in TCP (AccECN) - https://www.ietf.org/proceedings/90/slides/slides-90-tcpm-10.pdf ECN is slow, does not report condition extent, just it's existence. It lack= s inter- operability with DCTCP. Need to add mechanism for negotiating finer-grained= ,=20 adaptive congestion notification.=20 RFCS: A Proposal to add Explicit Congestion Notification (ECN) to IP - https://tools.ietf.org/html/rfc2481 Initial proposal. The Addition of Explicit Congestion Notification (ECN) to IP - https://tools.ietf.org/html/rfc3168 Elaboration and further specification of how to tie it in to TCP. =20 Adding Explicit Congestion Notification (ECN) Capability to TCP's SYN/ACK P= ackets - https://tools.ietf.org/html/rfc5562 Sometimes referred to as ECN+. This extends ECN to SYN/ACK packets. Note th= at SYN packets are still not covered, being considered a potential security hole. Accurate ECN (AccECN) Problem Statement and Requirements for Increased Accuracy in Explicit Congestion Notification (ECN) Feedback - https://tools.ietf.org/html/rfc7560 Problem Statement and Requirements for Increased Accuracy in Explicit Congestion Notification (ECN) Feedback "A primary motivation for this document is to intervene before each proprietary implementation invents its own non-interoperable handshake, which could lead to _de facto_ consumption of the few flags or codepoints that remain available for standardizing capability negotiation." Incast: The term was coined in [PANFS] for the case of increasing the number of simultaneously initiated, effectively barrier synchronized, fan-in flows=20 in to a single port to the point where the instantaneous switch / NIC buffe= ring capacity was exceeded. Thus causing a decline in aggregate bandwidth as the= need for re-transmits increases. This is further exacerbated by tail-drop behavi= or in the switch whereby multiple losses within individual streams exceeds the re= - covery abilities of duplicate ACKs or SACK, leading to RTOs before the flow= is=20 resumed. The Panasas ActiveScale Storage Cluster - Delivering Scalable High Bandwidth Storage [PANFS] - http://acm.supercomputing.org/sc2004/schedule/pdfs/pap207.pdf Focuses on the Object-based Storage Device (OSD) component backing the PanF= S=20 distributed file system. PanFS runs on the client, backend storage consists= of=20 networked block devices (OSD). The intelligence consists in how stripes are= laid out across OSD. PanFS relies on a Metadata Server (MDS) to control the inte= raction of clients with the objects on OSDs and maintain cache coherency. Scalable bandwidth is achieved through aggregation by striping data across = many OSDs. Although in principle it would be desirable to stripe files as widely= as possible. In practice, in their 1Gbps testbed (this is 2004) bandwidth scal= ed linearly from 3 to 7 OSDs but then after 14 OSDs aggregate bandwidth actua= lly decreases. With a 10ms disk access latency, if just one OSD experienced eno= ugh=20 packet loss to result in one 200ms RTO the system would suffer a 10x decrea= se in performance. Changes to address the incast problem: - Reduce the minRTO from 200ms to 50ms. - Tuning the _individual, socket buffer size. While a client must have a = large aggregate receive buffer size, each individual stream's receive buffer = should be relatively small. Thus they reduced the clients' (per OSD) receive s= ocket buffer to under 64K. - To reduce the size of a single synchronized incast response PanFS imple= ments a two level striping pattern. The first level is optimized for RAID's p= arity update performance and read overhead. The second level of striping is d= esigned to resist incast induced bandwidth penalties by stacking successive par= ity stripes that are stacked in the same subset of objects. They call N seq= uential parity stripes that are stacked in the same set of objects a 'visit', b= ecause a client repeatedly feteches data from just a few OSDs (whose number is= =20 controlled by parity stripe width) for a while, then moves on to the ne= xt set of OSDs. This striping pattern minimizes simultaneous fan-in and thus t= he=20 potential for incast. Typically PanFS stripes about 1GB of data per vis= it, using a round-robin layout algorithm of visits across all OSDs. Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storag= e Systems - https://www.usenix.org/legacy/event/fast08/tech/full_papers/phanishayee/p= hanishayee_html/ Attempts to do a more general analysis of incast than [PANFS]. Analysis is = based on the model of a cluster-based storage system with data blocks striped over a= number of servers. They refer to a single block fragmented over multiple servers as a= Server Request Unit (SRU). A subsequent block request will only be made after the = client=20 has received all the data for the current block. They refer to such reads a= s=20 'synchronized reads'. The paper makes three contributions to the literature= : - Explores the root causes of incast, characterizing it under a variety o= f=20 conditions (buffer space, varying number of servers, etc.). Buffer spac= e can delay the onset of Incast, but any particular switch configuration will= have some maximum number of servers that can send simultaneously before=20 throughput collapse occurs. =20 - Reproduce incast collapse on 3 different models of switches. In some = cases disabling QoS can help delay incast by freeing up packet buffers for= =20 general switching. - Demonstrate applicability of simulation by showing that the throughpu= t=20 collapse curve produced by ns-2 with a simulated 32KB buffer closely matches that shown by the HP Procurve 2848 with QoS disabled. =20 - Analysis of TCP traces obtained from simulation reveals that TCP re- transmission timeouts are the primary cause of incast. - Displays the effect of varying the switch buffer size. Doubling the s= ize of the switch's output port buffer doubles the number of servers that= can=20 be supported before the system experiences incast. - TCP performs well in settings without synchronized reads, which can be modelled by an infinite SRU size. Running netperf across many serv= ers does not induce incast. With larger SRU sizes servers can use the spa= re link capacity made available by any stalled flow waiting for a timeou= t event.=20 - Examines the effectiveness of existing TCP variants (e.g. Reno, NewReno= , SACK, and limited transmit). Although the move from Reno to NewReno=20 improves performance, none of the additional improvements help. When TC= P loses all packets in its window or loses retransmissions, no clever los= s recovery algorithms can help. - Examine a set of techniques that are moderately effective in masking In= cast, such as drastically reducing TCP's retransmission timeout timer. None o= f these techniques are without drawbacks. =20 - reducing RTOmin from 200ms to 200us improves throughput by an order o= f magnitde for 8-32 servers. However, at the time of the paper Linux an= d BSD TCP implementations were unable to provide a timer of sufficient= =20 granularity to calculate RTT at less than the system clock frequency. Understanding TCP Incast Throughput Collapse in Datacenter Networks - http://conferences.sigcomm.org/sigcomm/2009/workshops/wren/papers/p73.pdf Proposes an analytical model of limited generality based on the results observed in two test beds. - Observed little benefit from disabling delayed acks - Observed a much shallower decline in throughput after 4 servers with 1m= s minRTO vs 200ms minRTO. No benefit was shown for 200us over 1ms. [The= =20 next paper concludes that this was because the calculated RTO never wen= t below 5ms, so a 200us minRTO was equivalent to disabling minRTO in this setting]. - For large RTO timer values, reducing the RTO timer value is a first-ord= er=20 mitigation. For smaller RTO timer values, intelligently controlling the inter-packet wait time [pacing] becomes crucial. - Observes two regions of throughput increase. Following the initial=20 throughput decline there is an increasing region. They reason that: As the number of senders increase, 'T' increases, and there is less overlap in the RTO periods for different senders. This means the impact of RTO events is less severe - a mitigating effect.=20 (Prob(enter RTO at t) =3D { 1/T : d < t < d + T, 0: otherwise} - d is t= he=20 delay for congestion info to propagate back to the sender and T is the= =20 width of the uniform distribution in time.) - The smaller the RTO timer values, the faster the rate of recovery betwe= en=20 the throughput minimum and the second order throughput maximum. For sma= ller=20 RTO timer values, the same increase in 'T' will have a larger mitigatin= g=20 effect. Hence, as the number of senders increases, the same increase in= 'T' will result in a faster increase in the goodput for smaller RTO timer= =20 values. - After the second order goodput maximum, the slope of throughput decreas= e is the=20 same for different RTO timer values. When 'T' becomes comparable or lar= ger than the RTO timer value, the amount of interference between retransmits aft= er RTO=20 and transmissions before RTO no longer depends on the value of the RTO = timer. Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communic= ation - https://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/ekrevat/docs/SIGCOMMInca= st.pdf Effectively makes the case for using high resolution timers to neable micro= second granularity TCP timeouts. They claim that they demonstrate that this techni= que is effective in avoiding TCP incast collapse in both simulation and real-world= =20 experiments. - Prototype uses Linux's high resolution kernel timers. - Demonstrate that this change prevents incast collapse in practice for u= p to 47 senders. - Demonstrate that simply reducing RTOmin in today's [2009] TCP=20 implementations without also improving the timing granularity does not prevent TCP incast. =20 - Even without incast patterns, the RTO can determine observed performa= nce. Simple example: They started ten bulk-data transfer TCP flows from te= n clients to one server. They then had another client issue small request packets for 1KB of data from the server, waiting for the response before sending the next request. Approximately 1% of these requests experienced a TCP timeout, delaying the response by at least 200ms. Finer-grained re-transmission handlin= g can improve the performance of latency sensitive applications. Evaluating Throughput with Fine-Grained RTO: - to be maximally effective timers must operate on a granularity close= to the RTT of the network. - Jacobson RTO Estimation: - The standard RTO estimator [V. Jacobson, 98] tracks a smoothed =09 estimate of the round-trip time, and sets the timeout to this RT= T =09 estimate plus 4 times the mean deviation (a simpler calculation =09 than the standard deviation, and given a normal distribution of =09 prediction errors mdev =3D sqrt(pi/2)*sdev). =09 - RTO =3D SRTT + (4xRTTMDEV) =09 - Two factors set lower bounds on the value that the RTO can achie= ve: =20 - the explicit configuration parameter RTOmin =09 - the implicit effects of the granularity with which the RTT i= s=20 =09 measured and with which the kernel sets and checks timers. =09 Most implementations track RTTs and timers at a granularity =09 of 1ms or larger. Thus the minimum achievable RTO is 5ms. - In Simulation (simulate one client with multiple servers connected through a single switch with an unloaded RTT of 100us, each node has a 1Gbps link, the switch buffers have 32KB of space per output port, and a random timer scheduling delay of up to 20us to account for real-world variance): =20 - With an RTOmin of 200ms throughput drops by an order of magnitude =09 with 8 concurrent senders. - Reducing RTOmin to 1ms is effective for 8-16 concurrent senders, =09 fully utilizing the client's link. However, throughput declines =09 as the number of servers is increased. 128 concurrent senders =09 use only 50% of the available link bandwidth even with a 1ms =09 RTOmin. - In Real Clusters (sixteen node cluster w/ HP Procurve 2848 &=20 48 node cluster w/ Force10 S50 switch - all nodes 1Gbps and a client to server RTT of ~100us): - Modified the Linux 2.6.28 kernel to use 'microsecond-accurate' =09 timers with microsecond granularity RTT estimation. - For all configurations, throughput drops with increasing RTOmin =09 above 1ms. For 8 and 16 concurrent senders, the default RTOmin =09 of 200ms results in nearly 2 orders of magnitude drop in through- =09 put. - Results show identical performance for RTOmin values of 200us and =091 ms. Although teh baseline RTTs can be between 50-100us, increase= d =09congestion causes RTTs to rise to 400us on average with spikes as=20 =09high as 850us. Thus the higher RTTs combined with increased RTT =09variance causes the RTO estimator to set timeouts of 1-3ms and an =09RTOmin below 1ms will not lead to shorter retransmission times. =09In effect, specifying an RTOmin <=3D 1ms is equivalent to eliminating =09RTOmin. Next-Generation Datacenters: - 10Gbps networks have smaller RTTs than 1Gbps - port-to-port latency can be as low as 10us. In a sampling of an active storage node at=20 LANL 20% of RTTs are belowe 100us even when accounting for kernel scheduling. - smaller RTO values are required to avoid idle link time.=20 - Scaling to Thousands [simulating large numbers of servers on a 10Gbp= s network] (reduce baseline RTTs from 100us to 20us, eliminate 20us timer sched= uling variance, increase link capacity to 10Gbps, set per-port buffer size= to 32KB, increase blocksize to 80MB to ensure each flow can saturate a 10Gbps= link,=20 vary the number of servers from 32 to 2048): =20 - Having an artificial bound of either 1ms or 200us results in low t= hroughput =09 in a network whose RTTs are 20us - underscoring the requirement = that=20 =09 retransmission timeouts should be on the same timescale as network late= ncy =09 to avoid incast collapse. - Eliminating a lower bound on RTO performs well for up to 512 concu= rrent =09 senders. For 1024 servers and beyond, even the aggressively low = RTO =09 configuration sees up to a 50% reduction in throughput resulting from =09 significant periods of link idle time caused by repeated, simultaneous, =09 successive timeouts. =09=20 =09 - For incast communication the standard exponential backoff increase of =09 RTO can overshoot some portion of the time the link is actually idle. =09 Because only one flow must overshoot to delay the entire transfer,=20 =09 the probability of overshooting increases with increased number of =09 flows. =09 - Decreased throughput for a large number of flows can be attributed to =09 many flows timing out simultaneously, backing off deterministically, =09 and retransmitting at the same time. While some flows are successful =09 on this retransmission, a majority of flows lose their retransmitted =09 packet and backoff by another factor of two, sometimes far beyond =09 when the link becomes idle. - Desynchronizing Retransmissions =20 - Adding some randomness to the RTO will desynchronize retransmissio= ns. =20 - Adding an adaptive randomize RTO to the scheduled timeout: =09 timeout =3D (RTO + (rand(0.5) x RTO)) x 2^backoff =09 performs well regardless of the number of concurrent senders.=20 =09 Nonetheless, real-world variances my be large enough to avoid the =09 need for explicit randomization in practice. - Do not evaluate the impact on wide area flows. - Implementing fine-grained retransmissions =20 - Three changes to the Linux TCP stack were required: =09=20 =09 - microsecond resolution time accounting to track RTTs with greater =09 precision - store microseconds in the TCP timestamp option=20 =09 [timestamp resolution can go as high as 57ns without violating the =09 requirements of PAWS] =09 - redefinition of TCP constants - timer constants formerly defined in= =20 =09 terms of jiffies [ticks] are converted to absolute values (e.g. 1ms= =20 =09 instead of 1 jiffy) =09 =09=20 =09 - replacement of low-resolution timers with hrtimers - replace standard =09 timer objects in the socket structure with the hrtimer structure, =09 ensuring that all calls to set, reset, or clear timers use the =09 hrtimer functions. - Results: =09=20 =09 - Using the default 200ms RTOmin throughput plummets beyond 8 =09 concurrent senders on both testbeds. =09 - On the 16 server testbed a 5ms jiffy-based RTOmin throughput begins= =20 =09 to drop at 8 servers to ~70% of link capacity and slowly decreases=20 =09 thereafter. On the 47 server testbed [Force10 switch] the 5ms=20 =09 RTOmin kernel obtained 70-80% throughput with a substantial =09 decline after 40 servers. =09 =20 =09 - TCP hrtimer implementation / microsecond RTO kernel is able to =09 saturate the link for up to 16/47 servers [total number in=20 =09 both testbeds]. - Implications of Fine-Grained TCP Retransmissions: =09 - A receiver's delayed ACK timer should always fire before the s= enders =09 retransmission timer fires to prevent the sender form timing out =09 waiting for an ACK that is merely delayed. Current system protect =09 against this by setting the delayed ACK timer to a value (40ms) =09 that is safely under the RTOmin (200ms). =09=20 =09- A host with microsecend granularity retransmissions would periodically =09 experience an unnecessary timeout when communicating with unmodified =09 hosts in environments where the RTO is below 40ms (e.g., in the data =09 center and for short flows in the WAN), because the sender incorrectly =09 assumes that a loss has occurred. In practice the two consequences =09 are mitigated by newer TCP features and the limited circumstances in =09 which they occur (and bulk data transfer is essentially unimpacted by= =20 =09 the issue). =09 - The major potential effect of a spurious timeout is a loss of =09 performance: a flow that experiences a timeout will reduce =09 its slow-start threshold (ssthresh) by half, its window to one =09 and attempt to rediscover link capacity. It is important to =09 understand that spurious timeouts do not endanger network =09 stability through increased congestion [On estimating end-to-end =09 network path properties. SIGCOMM 99]. Spurious timeouts =09 occur not when the network path drops packets, but rather when=20 =09 the path observers a sudden, higher delay. =09=20 =09 - Several algorithms have been proposed to undo the effects of spurious =09 timeouts have been proposed and, in the case of F-RTO [Forward=20 =09 RTO-Recovery RFC 4138], adopted in the Linux TCP implementation. - When seeding torrents over a WAN there was no observable differenc= e =09 in performance between the 200us and 200ms RTOmin [no penalty]. - Interaction with Delayed ACK in the Datacenter: For servers using = a =09 reduced RTO in a datacenter environment, the server's retrans= mission=20 =09 timer may expire long before an unmodied client's 40ms delayed ACK time= r =09 expires. As a result, the server will timeout and resend the unacked =09 packet, cutting ssthresh in half and rediscovering link capacity using =09 slow-start. Because the client acknowledges the retransmitted segment= =20 =09 immediately, the server does not observe a coarse-grained 40ms delay,= =20 =09 only an unnecessary timeout. - Although for full performance delayed acks should be disabled, unm= odified =09 clients still achieve good performance and avoid incast when onl= y the =09 servers implement fine-grained retransmissions. Data Center Transmission Control Protocol (DCTCP): The Microsoft & Stanford developed CC protocol uses simplified switch RED/E= CN CE marking to=20 provide fine grained congestion notification to senders. RED is enabled in = the switch but minth=3Dmaxth=3DK, where K is an empirically determined constant that is a = function of bandwidth and desired switch utilization vs rate of convergence. Common values for K = are 5 for 1Gbps and 60 for 10Gbps. The value for 40Gbps is presumably on the order of 240. = The sender's=20 congestion window is scaled back once per RTT as function of (#ECE/(#segmen= ts in window))/2. In the degenerate case of all segments being marked window is scaled back a= la a loss in Reno. In the steady state latencies are much lower than in Reno due to cons= iderably reduced switch occupancy.=20 There is currently no mechanism for negotiating CC protocols and DCTCP's re= liance on continuous ECE notifications is incompatible with ECN's continuous repeating of the sa= me ECE until a CWR is received. In effect ECN support has to be sucessfully negotiated when es= tablishing the=20 connection, but the receiver has to instead provide one ECE per new CE seen= .=20 RFC: Datacenter TCP (DCTCP): TCP Congestion Control for Datacenters https://tools.ietf.org/pdf/draft-ietf-tcpm-dctcp-00.pdf The window scaling constant is referred to as 'alpha'. Alpha=3D0 correspond= s to no congestion, alpha=3D1 corresponds to a loss event in Reno or an ECE m= ark in standard ECN - resulting in a halving of the congestion window. 'g' is the feedback= gain, 'M' is the=20 fraction of bytes marked to bytes sent. Alpha and the congestion window 'cw= nd' are calculated as follows: alpha =3D alpha * (1 - g) + g * M cwnd =3D cwnd * (1 - alpha/2) To cope with delayed acks DCTCP specifies the following state machine - CE = refers to DCTCP.CE,=20 a new Boolean TCP state variable, "DCTCP Congestion Encountered" - which is= initialized to=20 false and stored in the Transmission Control Block (TCB). =20 Send immediate ACK with ECE=3D0 .----. .-------------. .---. Send 1 ACK / v v | | \ for every | .------. .------. | Send 1 ACK m packets | | CE=3D0 | | CE=3D1 | | for every with ECE=3D0 | =E2=80=99------=E2=80=99 =E2=80=99-= -----=E2=80=99 | m packets \ | | ^ ^ / with ECE=3D1 =E2=80=99---=E2=80=99 =E2=80=99------------=E2= =80=99 =E2=80=99----=E2=80=99 Send immediate ACK with ECE=3D1 The clear implication of this is that if the ack is delayed by more than m,= as in different assumptions between peers or dropped ACKs, the signal can underestimate the= level of encountered=20 congestion. None of the literature suggests that this has been a problem in= practice. [Section 3.4 of RFC] Handling of SYN, SYN-ACK, RST Packets [RFC3168] requires that a compliant TCP MUST NOT set ECT on SYN or SYN-ACK packets. [RFC5562] proposes setting ECT on SYN-ACK packets, but maintains the restriction of no ECT on SYN packets. Both these RFCs prohibit ECT in SYN packets due to security concerns regarding malicious SYN packets with ECT set. These RFCs, however, are intended for general Internet use, and do not directly apply to a controlled datacenter environment. The switching fabric can drop TCP packets that do not have the ECT set in the IP header. If SYN and SYN-ACK packets for DCTCP connections do not have ECT set, they will be dropped with high probability. For DCTCP connections, the sender SHOULD set ECT for SYN, SYN-ACK and RST packets. [Section 4] Implementation Issues - the implementation must choose a suitable estimation gain (feedback gain) - [DCTCP10] provides a theoretical basis for its selection, in practice more practical to select empirically by network/workload - The Microsoft implementation uses a fixed estimation gain of 1/16 - the implementation must decide when to use DCTCP. DCTCP may not be=20 suitable or supported for all peers. - It is RECOMMENDED that the implementation deal with loss episodes in the same way as conventional TCP. - To prevent incast throughput collapse, the minimum RTO (MinRTO) should be= =20 lowered significantly. The default value of MinRTO in Windows is 300ms,= =20 Linux 200ms, and FreeBSD 233ms. A lower MinRTO requires a correspondingl= y=20 lower delayed ACK timeout on the receiver. Thus, it is RECOMMENDED that a= n=20 implementation allow configuration of lower timeouts for DCTCP connection= s. - It is also RECOMMENDED that an implementation allow configuration of=20 restarting the congestion window (cwnd) of idle DCTCP connections as desc= ribed=20 in [RFC5681]. - [RFC3168] forbids the ECN-marking of pure ACK packets, because of the inability of TCP to mitigate ACK-path congestion and protocol-wise preferential treatment by routers. However, dropping pure ACKs - rather than ECN marking them - has disadvantages for typical datacenter traffic patterns. Dropping of ACKs causes subsequent re- transmissions. It is RECOMMENDED that an implementation provide a=20 configuration knob that forces ECT to be set on pure ACKs. [Section 5] Deployment Issues - DCTCP and conventional TCP congestion control do not coexist well in the same network. In DCTCP, the marking threshold is set to a very low value to reduce queueing delay, and a relatively small amount of congestion will exceed the marking threshold. During such periods of congestion, conventional TCP will suffer packet loss and quickly and drastically reduce cwnd. DCTCP, on the other hand, will use the fraction of marked packets to reduce cwnd more gradually. Thus, the rate reduction in DCTCP will be much slower than that of conventional TCP, and DCTCP traffic will gain a larger share of the capacity compared to conventional TCP traffic traversing the same path. It is RECOMMENDED that DCTCP traffic be segregated from conventional TCP traff= ic. [MORGANSTANLEY] describes a deployment that uses the IP DSCP bits to=20 segregate the network such that AQM is applied to DCTCP traffic, whereas= =20 TCP traffic is managed via drop-tail queueing. - Since DCTCP relies on congestion marking by the switches, DCTCP can only be deployed in datacenters where the entire network infrastructure supports ECN. The switches may also support configuration of the congestion threshold used for marking. The proposed parameterization can be configured with switches that implement RED. [DCTCP10] provides a theoretical basis for selecting the congestion threshold, but as with the estimation gain, it may be more practical to rely on experimentation or simply to use the default configuration of the device. DCTCP will degrade to loss- based congestion control when transiting a congested drop-tail link. - DCTCP requires changes on both the sender and the receiver, so both endpoints must support DCTCP. Furthermore, DCTCP provides no mechanism for negotiating its use, so both endpoints must be configured through some out-of-band mechanism to use DCTCP. A variant of DCTCP that can be deployed unilaterally and only requires standard ECN behavior has been described in [ODCTCP][BSDCAN], but requires additional experimental evaluation. [Section 6] Known Issues - DCTCP relies on the sender=E2=80=99s ability to reconstruct the stream o= f CE codepoints received by the remote endpoint. To accomplish this, DCTCP avoids using a single ACK packet to acknowledge segments received both with and without the CE codepoint set. However, if one or more ACK packets are dropped, it is possible that a subsequent ACK will cumulatively acknowledge a mix of CE and non-CE segments. This will, of course, result in a less accurate congestion estimate. o Even with an inaccurate congestion estimate, DCTCP may still perform better than [RFC3168]. o If the estimation gain is small relative to the packet loss rate, the estimate may not be too inaccurate. o If packet loss mostly occurs under heavy congestion, most drops will occur during an unbroken string of CE packets, and the estimate will be unaffected - The effect of packet drops on DCTCP under real world conditions has not b= een analyzed. - Much like standard TCP, DCTCP is biased against flows with longer RTTs. A method for improving the fairness of DCTCP has been proposed in [ADCTCP], but requires additional experimental evaluation. Papers: Data Center TCP [DCTCP10] - http://research.microsoft.com/en-us/um/people/padhye/publications/dctcp-s= igcomm2010.pdf The original DCTCP SIGCOMM paper by Stanford and Microsoft Research. It is = very accessible even for those of us not well versed in CC protocols. - reduce minRTO to 10ms. - suggest that K > (RTT * C)/7, where C is the sending rate in packets per= second. Attaining the Promise and Avoiding the Pitfalls of TCP=20 in the Datacenter [MORGANSTANLEY] - https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-judd.p= df Real world experience deploying DCTCP on Linux at Morgan Stanley. - reduce minRTO to 5ms. - reduce delayed ACK to 1ms. - Only ToR switches support ECN marking, higher level switches purely tai= l-drop. Tests show that DCTCP successfully resorts to loss-based congestion con= trol when transiting a congested drop-tail link. - Find that setting ECT on SYN and SYN-ACK is critical for the practical= =20 deployment of DCTCP. Under load, DCTCP would fail to establish network= =20 connections in the absence of ECT in SYN and SYN-ACK packets. (DCTCP+) - Without correct receive buffer tuning DCTCP will converge _faster_ than= TCP, rather than the theoretical 1.4 x TCP. Per-packet latency in ms =09 TCP=09 DCTCP+ Mean=09 4.01=09 0.0422 Median=09 4.06=09 0.0395 Maximum=09 4.20=09 0.0850 Minimum=09 3.32=09 0.0280 sigma=09 0.167 0.0106 Extensions to FreeBSD Datacenter TCP for Incremental Deployment Support [BSDCAN] - https://www.bsdcan.org/2015/schedule/attachments/315_dctcp-bsdcan2015-pap= er.pdf Proposes a variant of DCTCP that can be deployed only on one endpoint of a = connection, provided the peer is ECN-capable. ODTCP changes: - In order to facilitate one-sided deployment, a DCTCP sender should set the CWR mark after receiving an ECE- marked ACK once per RTT. It is safe in two-sided deploy- ments, because a regular DCTCP receiver will simply ig- nore the CWR mark.=20 - A a one-sided DCTCP receiver should always delay an ACK for=20 incoming packets marked with CWR, which is the only indication of recovery exit. DCTCP improvements: - ECE processing: Under standard ECN an ACK with an ECE mark will trigger congestion recovery. When this happens a sender stops increasing cwnd for one RTT. For DCTCP there is no reason for this response. ECEs are used, not for detecting congestion=20 events, but to quantify the extent of congestion and react=20 proportionally. Thus, there is no need to stop cwnd from in- creasing.=20 - Set initial value of alpha to 0 (i.e. don't halve cwnd on first ECE seen). - Idle Periods: The same tradeoffs regarding "slow-start restart" apply to alpha. The FreeBSD implementation re-initializes alpha after an idle period longer than the RTO. - Timeouts and Packet Loss: The DCTCP specification defines the update interval for alpha as one RTT. To track this DCTCP compares received ACKs against the sequence numbers of outgoing packets. This is not robust in the face of packet loss. The FreeBSD=20 implementation addresses this by updating alpha when it detects duplicate ACKs or timeouts. =20 Data Center TCP (DCTCP) - http://www.ietf.org/proceedings/80/slides/iccrg-3.pdf Case studies, workloads, latency and flow completion time of TCP vs DCTCP. Interesting set of slides worth skimming. - Small (10-100KB & 100KB - 1MB) background flows complete in ~45% less= =20 time than TCP. - 99th %ile & 99.9th %ile query flows are 2/3rds and 4/7ths respectively - large (1-10MB & > 10MB) flows unchanged - query completion time with 10 to 1 background incast unchanged with=20 DCTCP, ~5x slower with TCP Analysis of DCTCP: Stability, Convergence, and Fairness [ADCTCP] - http://sedcl.stanford.edu/files/dctcp-analysis.pdf Follow up mathematical analysis of DCTCP using a fluid model. Contains=20 interesting graphs showing how the gain factor affects the convergence rate between two flows. - Analyzes the convergence of DCTCP sources to their fair share, obtaini= ng an explicit characterization of the convergence rate. - Proposes a simple change to DCTCP suggested by the fluid model which= =20 significantly improves DCTCP's RTT-fairness. It suggests updating the= =20 congestion window continuously rather than once per RTT. - Finds that with a marking threshold, K, of about 17% of the bandwidth- delay product, DCTCP achieves 100% throughput, and that even for value= s=20 of K as small as 1% of the bandwidth-delay product, its throughput is= =20 at least 94%. - Show that DCTCP's convergence rate is no more than a factor 1.4 slower= than=20 TCP Using Data Center TCP (DCTCP) in the Internet [ADCTCP] - http://www.ikr.uni-stuttgart.de/Content/Publications/Archive/Wa_GLOBECOM_= 14_40260.pdf Investigates what would be needed to deploy DCTCP incrementally outside the= data center. - Proposes finer resolution for alpha value - Allow the congestion window to grow in the CWR state (similar to [BSDC= AN]) - Continuous update of alpha: Define a smaller gain factor (1/2^8 instea= d of 1/2^4) to permit an EWMA updated every packet. However, g should actually be = a function of number of packets in flight. - Progressive congestion window reduction: Similar to [ADCTCP], reduce t= he congestion window on the reception of each ECE. - develops a formula for AQM RED parameters that always results in equal= sharing between DCTCP and non-DCTCP. Incast Transmission Control Protocol (ICTCP): In ICTCP the receiver plays a direct role in estimating the per-flow availa= ble bandwidth and actively re-sizes each connection's receive window accordingly. - http://research.microsoft.com/pubs/141115/ictcp.pdf Quantum Congestion Notification (QCN): Congestion control in ethernet. Introduced as part of the IEEE 802.2 Standa= rds=20 Body discussions for Data Center Bridging [DCB] motivated by the needs of F= CoE.=20 The initial congestion control protocol was standardized as 802.1Qau. Unlik= e=20 the single bit of congestion information per-packet in TCP QCN uses 6-bits. The algorithm is composed of two main parts: Switch or Control Point (CP)= =20 Dynamics and Rate Limiter or Reaction Point (RP) Dynamics. - The CP Algorithm runs at the network nodes. Its objective is to maintai= n the node's buffer occupancy at the operating point 'Beq'. It computes a con= - gestion measure Fb and randomly samples an incoming packet with a proba= bility=20 proportional to the severity of the congestion. The node sends a 6-bit= =20 quantized value of Fb back to the source of the sampled packet. =20 - B: Value of the current queue length - Bold: Value of the buffer occupancy when the last feedback message wa= s=20 generated. - w: a non-negative constant, equal to 2 for the baseline implementatio= n - Boff =3D B - Beq - Bd =3D B - Bold - Fb =3D Boff + w*Bd - essentially equivalent to the PI AQM. The first term is the offset =09from the target operating point and the second term is proportiona= l =09to the rate at which the queue size is changing. When Fb < 0, there is no congestion, and no feedback messages are sent= . When Fb >=3D 0, then either the buffers or the link is oversubscribed,= and control action needs to be taken. - The RP algorithm runs on end systems (NICs) and controls the rate at w= hich ethernet packets are transmitted. Unlike TCP, the RP algorithm does no= t get positive ACKs from the network and thus needs alternative mechanis= ms for increasing its sending rate. =20 - Current Rate (Rc): The transmission rate of the source - Target Rate (Rt): The transmission rate of the source just before th= e=20 arrival of the last feedback message - Gain (Gd): a constant chosen so that Gd*|Fbmax| =3D 1/2 - that is to= say the rate can decrease by at most 50%. Only 6 bits are available for feedback so Fbmax =3D 64, and thus Gd =3D 1/128. - Byte counter: A counter at the RP for counting transmitted bytes; us= ed to time rate increases - Timer: A clock at the RP used for timing rate increases. Rate Decreases: A rate decrease is only done when a feedback message is received: - Rt <- Rc - Rt <- Rc*(1 - Gd*|Fb|)=20 Rate Increases: Rate Increase is done in two phases: Fast Recovery and Active Increase= . Fast Recovery (FR): The source enters the FR state immediately after= a rate decrease event - at which point the Byte Counter is reset. FR consists of 5 cycles, in each of which 150KB of data (assuming full- sized regular frames) are transmitted (100 packets of 1500 bytes eac= h), as counted by the Byte Counter. At the end of each cycle, Rt remains unchanged, and Rc is updated as follows: =09=09 =20 =09=09 Rc <- (Rc + Rt)/2 =09The rationale being that, when congested, Rate Decrease messages are =09sent by the CP once every 100 packets. Thus the absence of a Rate =09Decrease message during this interval indicates that the CP is no =09longer congested. Active Increase (AI): After 5 cycles of FR, the source enters the AI state when it probes for extra bandwidth. AI consists of multiple cycles of 50 packets each. Rt and Rc are updated as follows: =09= =20 =09 - Rt <- Rt + Rai =09 - Rc <- (Rc + Rt)/2 =09 - Rai: a constant set to 5Mbps by default. When Rc is extremely small after a rate decrease the time required to send out 150 KB can be excessive. To increase the rate of increase the source also uses a timer that is used as follows:=20 =09 1) reset timer when rate decrease message arrives =09 2) source enters FR and counts out 5 cycles of T ms duration =09 (T =3D 10ms in baseline implementation), and in the AI state, =09 each cycle is T/2 ms long =09 3) in the AI state, Rc is updated when _either_ the Byte Counter =09 or the Timer completes a cycle. =09 4) The source is is in teh AI state iff either the Byte Counter =09 or the timer is in teh AI state. =09 5) if _both_ the Byte Counter and the Timer ar in AI the source is =09 said to be in Hyper-Active Increase (HAI). In this case, at the =09 completion of the ith Byte Counter and Timer cycle, Rt and Rc =09 are updated: =09 - Rt <- Rt + i*Rhai =09 - Rc <- (Rc + Rt) / 2 =09 - Rhai: 50Mbps in the baseline [Taken from "Internet Congestion Control" by Subir Varna, ch. 8] Performance of Quantized Congestion Notification in TCP Incast Scenarios of= =20 Data Centers - http://eprints.networks.imdea.org/131/1/Performance_of_Quantized_Congesti= on_Notification_-_2010_EN.pdf Using the QCN pseudocode version released by Rong Pan [IEEE EDCS-608482]=20 simulated the performance of QCN at 1Gbps under a number of incast scenario= s, reaching the conclusion that the the default QCN behaviors will not scale to large number of flows with full link utilization. It goes on to propose a small number of changes to the QCN algorithm that _will_ support a large number of flows at full link utilization. However, there is no indication i= n the literature that these ideas have been taken any further in practice. A surv= ey paper written in 2014 [A Survey on TCP Incast in Data Center Networks] indi= cates that these problems still exist. It is unclear what the current state of th= e art is in shipping hardware. http://www.ieee802.org/3/ar/public/0505/bergamasco_1_0505.pdf http://www.ieee802.org/1/files/public/docs2007/au-bergamasco-ecm-v0.1.pdf http://www.cs.wustl.edu/~jain/papers/ftp/bcn.pdf http://www.cse.wustl.edu/~jain/papers/ftp/icc08.pdf Recommendations:=20 RFC 6298:=20 - change starting RTO from 3s to 1s=09 (in /dctcp)=09=09=09=09D4294 - DO NOT round RTO up to 1s counter to the suggestions here (long done) - simplify setting of minRTO sysctl to eliminate "slop" component"=09= =09D4294 (in /dctcp) RFC 6928: - increase initial / idle window to 10 segments when connecting to=09(d= one by hiren) data center peers RFC 7323: - stop truncating SRTT prematurely on low-latency conections,=09=09D429= 3 see appendix G to calculate reduce potentially detrimental fluctuations in calculated RTO Incast: - do SW TSO only - add rudimentary pacing by interleaving streams - fine grained timers=09=09=09=09=09=09=09=09D4292 - scale RTO down to same granularity as RTT=09(patch in progres) ECN: - change default to allow ECN on incoming connections - set ECT on _ALL_ packets sent by a host using a DCTCP connection =20 - add facility to enable ECN by subnet DCTCP: - add facility to enable DCTCP by subnet - set ECT on _ALL_ packets used by a host using a DCTCP connection - update TCP to use microsecond granularity timers to timestamps (patch in= progress) - when using current coarse-grained timers reduce minRTO to 3ms=09=09D4294 when using DCTCP, if fine-grained timers are available disable minRTO when using DCTCP - reduce delack to 1/5th of min(minRTO, RTO) (reduced to 1/2 in /dctcp)=09= D4294 ICTCP: - if there is time investigate it's use and the ability to use the socket buffer sizing to communicate the amount of anticipated data for purposes of TCB's sharing the port's connection optimally=09