From owner-freebsd-net@freebsd.org  Fri Nov 27 01:57:46 2015
Return-Path: <owner-freebsd-net@freebsd.org>
Delivered-To: freebsd-net@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4D72DA3A73F
 for <freebsd-net@mailman.ysv.freebsd.org>;
 Fri, 27 Nov 2015 01:57:46 +0000 (UTC)
 (envelope-from mmacy@nextbsd.org)
Received: from sender163-mail.zoho.com (sender163-mail.zoho.com
 [74.201.84.163])
 (using TLSv1 with cipher ECDHE-RSA-AES128-SHA (128/128 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 3E34D1A7F
 for <freebsd-net@freebsd.org>; Fri, 27 Nov 2015 01:57:45 +0000 (UTC)
 (envelope-from mmacy@nextbsd.org)
Received: from mail.zoho.com by mx.zohomail.com
 with SMTP id 1448589456055734.7743441994829;
 Thu, 26 Nov 2015 17:57:36 -0800 (PST)
Date: Thu, 26 Nov 2015 17:57:35 -0800
From: Matthew Macy <mmacy@nextbsd.org>
To: "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>
Message-ID: <15146a8f285.b094791a15089.3823664487014698900@nextbsd.org>
Subject: TCP notes and incast recommendations
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Priority: Medium
User-Agent: Zoho Mail
X-Mailer: Zoho Mail
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 27 Nov 2015 01:57:46 -0000

In an effort to be somewhat current on the state TCP I've collected a small=
 bibliography. I've tried
to summarize RFCs and papers that I believe to be important and provide som=
e general background for
others who do not have a deeper familiarity with TCP or congestion control =
- in particular as impacts DCTCP.


Recommendations references phabricator changes.

Table Of Contents:

I)    - A Roadmap for Transmission Control Protocol (TCP)
        Specification Documents (RFC 7414)

II)   - Metrics for the Evaluation of Congestion Control Mechanisms
      =09(RFC 5166)

III)  - TCP Congestion Control (RFC 5681)

IV)   - Computing TCP's Retransmission Timer (RFC 6298)

V)    - Increasing TCP's Initial Window (RFC 6928)

VI)   - TCP Extensions for High Performance [RTO updates
      =09and changes to RFC 1323] (RFC 7323)

VII)  - Updating TCP to Support Rate-Limited Traffic
      =09[Congestion Window Validation] (RFC 7661)

VIII) - Active Queue Management (AQM)

IX)   - Explicit Congestion Notification (ECN)

X)    - AccurateECN (AccECN)

XI)   - Incast Causes and Solutions

XII)  - Data Center Transmission Control Protocol (DCTCP)

XIII) - Incast TCP (ICTCP)

XIV)  - Quantum Congestion Notification (QCN)

XV)   - Recommendations


A Roadmap for Transmission Control Protocol (TCP)
Specification Documents [important]:
https://tools.ietf.org/html/rfc7414

   A correct and efficient implementation of the Transmission Control
   Protocol (TCP) is a critical part of the software of most Internet
   hosts.  As TCP has evolved over the years, many distinct documents
   have become part of the accepted standard for TCP.  At the same time,
   a large number of experimental modifications to TCP have also been
   published in the RFC series, along with informational notes, case
   studies, and other advice.

   As an introduction to newcomers and an attempt to organize the
   plethora of information for old hands, this document contains a
   roadmap to the TCP-related RFCs.  It provides a brief summary of the
   RFC documents that define TCP.  This should provide guidance to
   implementers on the relevance and significance of the standards-track
   extensions, informational notes, and best current practices that
   relate to TCP.

   This roadmap includes a brief description of the contents of each
   TCP-related RFC [N.B. I only include an excerpt of the summary for those
   that I consider interesting or important].  In some cases, we simply sup=
ply=20
   the abstract or a key summary sentence from the text as a terse descript=
ion. =20
   In addition, a letter code after an RFC number indicates its category in=
 the
   RFC series (see BCP 9 [RFC2026] for explanation of these categories):

   S - Standards Track (Proposed Standard, Draft Standard, or Internet
       Standard)

   E - Experimental

   I - Informational

   H - Historic

   B - Best Current Practice

   U - Unknown (not formally defined)


[2.]  Core Functionality

   A small number of documents compose the core specification of TCP.
   These define the required core functionalities of TCP's header
   parsing, state machine, congestion control, and retransmission
   timeout computation.  These base specifications must be correctly
   followed for interoperability.


   RFC 793 S: "Transmission Control Protocol", STD 7 (September 1981)
              (Errata)

      This is the fundamental TCP specification document [RFC793].
      Written by Jon Postel as part of the Internet protocol suite's
      core, it describes the TCP packet format, the TCP state machine
      and event processing, and TCP's semantics for data transmission,
      reliability, flow control, multiplexing, and acknowledgment.

   RFC 1122 S: "Requirements for Internet Hosts - Communication Layers"
               (October 1989)

      This document [RFC1122] updates and clarifies RFC 793 (see above
      in Section 2), fixing some specification bugs and oversights.  It
      also explains some features such as keep-alives and Karn's and
      Jacobson's RTO estimation algorithms [KP87][Jac88][JK92].  ICMP
      interactions are mentioned, and some tips are given for efficient
      implementation.  RFC 1122 is an Applicability Statement, listing
      the various features that MUST, SHOULD, MAY, SHOULD NOT, and MUST
      NOT be present in standards-conforming TCP implementations.
      Unlike a purely informational roadmap, this Applicability
      Statement is a standards document and gives formal rules for
      implementation.


   RFC 2460 S: "Internet Protocol, Version 6 (IPv6) Specification"
               (December 1998) (Errata)

      This document [RFC2460] is of relevance to TCP because it defines
      how the pseudo-header for TCP's checksum computation is derived
      when 128-bit IPv6 addresses are used instead of 32-bit IPv4
      addresses.  Additionally, RFC 2675 (see Section 3.1 of this
      document) describes TCP changes required to support IPv6
      jumbograms.

   RFC 2873 S: "TCP Processing of the IPv4 Precedence Field" (June 2000)
               (Errata)

      This document [RFC2873] removes from the TCP specification all
      processing of the precedence bits of the TOS byte of the IP
      header.  This resolves a conflict over the use of these bits
      between RFC 793 (see above in Section 2) and Differentiated
      Services [RFC2474].

   RFC 5681 S: "TCP Congestion Control" (August 2009)

      Although RFC 793 (see above in Section 2) did not contain any
      congestion control mechanisms, today congestion control is a
      required component of TCP implementations.  This document
      [RFC5681] defines congestion avoidance and control mechanism for
      TCP, based on Van Jacobson's 1988 SIGCOMM paper [Jac88].

      A number of behaviors that together constitute what the community
      refers to as "Reno TCP" is described in RFC 5681.  The name "Reno"
      comes from the Net/2 release of the 4.3 BSD operating system.
      This is generally regarded as the least common denominator among
      TCP flavors currently found running on Internet hosts.  Reno TCP
      includes the congestion control features of slow start, congestion
      avoidance, fast retransmit, and fast recovery.

      RFC 5681 details the currently accepted congestion control
      mechanism, while RFC 1122, (see above in Section 2) mandates that
      such a congestion control mechanism must be implemented.  RFC 5681
      differs slightly from the other documents listed in this section,
      as it does not affect the ability of two TCP endpoints to
      communicate;

      RFCs 2001 and 2581 are the conceptual precursors of RFC 5681.  The
      most important changes relative to RFC 2581 are:

      (a)  The initial window requirements were changed to allow larger
           Initial Windows as standardized in [RFC3390] (see Section 3.2
           of this document).
      (b)  During slow start and congestion avoidance, the usage of
           Appropriate Byte Counting [RFC3465] (see Section 3.2 of this
           document) is explicitly recommended.
      (c)  The use of Limited Transmit [RFC3042] (see Section 3.3 of
           this document) is now recommended.

   RFC 6093 S: "On the Implementation of the TCP Urgent Mechanism"
               (January 2011)

      This document [RFC6093] analyzes how current TCP stacks process
      TCP urgent indications, ... and recommends against the use of urgent=
=20
      mechanism.

   RFC 6298 S: "Computing TCP's Retransmission Timer" (June 2011)

      Abstract of RFC 6298 [RFC6298]: "This document defines the
      standard algorithm that Transmission Control Protocol (TCP)
      senders are required to use to compute and manage their
      retransmission timer.  It expands on the discussion in
      Section 4.2.3.1 of RFC 1122 and upgrades the requirement of
      supporting the algorithm from a SHOULD to a MUST."  RFC 6298
      updates RFC 2988 by _changing_ the initial RTO from _3s_ to _1s_
      [emphasis mine].

   RFC 6691 I: "TCP Options and Maximum Segment Size (MSS)" (July 2012)

      This document [RFC6691] clarifies what value to use with the TCP
      Maximum Segment Size (MSS) option when IP and TCP options are in
      use.


[3.]  Strongly Encouraged Enhancements

   This section describes recommended TCP modifications that improve
   performance and security.  Section 3.1 represents fundamental changes
   to the protocol.  Sections 3.2 and 3.3 list improvements over the
   congestion control and loss recovery mechanisms as specified in RFC
   5681 (see Section 2).  Section 3.4 describes algorithms that allow a
   TCP sender to detect whether it has entered loss recovery spuriously.
   Section 3.5 comprises Path MTU Discovery mechanisms.  Schemes for
   TCP/IP header compression are listed in Section 3.6.  Finally,
   Section 3.7 deals with the problem of preventing acceptance of forged
   segments and flooding attacks.

[3.1.]  Fundamental Changes

   RFCs 2675 and 7323 represent fundamental changes to TCP by redefining
   how parts of the basic TCP header and options are interpreted.  RFC
   7323 defines the Window Scale option, which reinterprets the
   advertised receive window.  RFC 2675 specifies that MSS option and
   urgent pointer fields with a value of 65,535 are to be treated

   RFC 2675 S: "IPv6 Jumbograms" (August 1999) (Errata)

   RFC 7323 S: "TCP Extensions for High Performance" (September 2014)

      This document [RFC7323] defines TCP extensions for window scaling,
      timestamps, and protection against wrapped sequence numbers, for
      efficient and safe operation over paths with large bandwidth-delay
      products.  These extensions are commonly found in currently used
      systems.  The predecessor of this document, RFC 1323, was
      published in 1992, and is deployed in most TCP implementations.
      This document includes fixes and clarifications based on the
      gained deployment experience.  One specific issued addressed in
      this specification is a recommendation how to modify the algorithm
      for estimating the mean RTT when timestamps are used.  RFCs 1072,
      1185, and 1323 are the conceptual precursors of RFC 7323.

[3.2.] Congestion Control Extensions

   Two of the most important aspects of TCP are its congestion control
   and loss recovery features.  TCP treats lost packets as indicating
   congestion-related loss and cannot distinguish between congestion-
   related loss and loss due to transmission errors.  Even when ECN is
   in use, there is a rather intimate coupling between congestion
   control and loss recovery mechanisms.  There are several extensions
   to both features, and more often than not, a particular extension
   applies to both.  In these two subsections, we group enhancements to
   TCP's congestion control, while the next subsection focus on TCP's
   loss recovery.

   RFC 3168 S: "The Addition of Explicit Congestion Notification (ECN)
               to IP" (September 2001)

      This document [RFC3168] defines a means for end hosts to detect
      congestion before congested routers are forced to discard packets.
      Although congestion notification takes place at the IP level, ECN
      requires support at the transport level (e.g., in TCP) to echo the
      bits and adapt the sending rate.  This document updates RFC 793
      (see Section 2 of this document) to define two previously unused
      flag bits in the TCP header for ECN support.

   RFC 3390 S: "Increasing TCP's Initial Window" (October 2002)

      This document [RFC3390] specifies an increase in the permitted
      initial window for TCP from one segment to three or four segments
      during the slow start phase, depending on the segment size.

   RFC 3465 E: "TCP Congestion Control with Appropriate Byte Counting
               (ABC)" (February 2003)

      This document [RFC3465] suggests that congestion control use the
      number of bytes acknowledged instead of the number of
      acknowledgments received.  This change improves the performance of
      TCP in situations where there is no one-to-one relationship
      between data segments and acknowledgments (e.g., delayed ACKs or
      ACK loss). ABC is recommended by RFC 5681 (see Section 2).

   RFC 6633 S: "Deprecation of ICMP Source Quench Messages" (May 2012)

      This document [RFC6633] formally deprecates the use of ICMP Source
      Quench messages by transport protocols and recommends against the
      implementation of [RFC1016].

[3.3.]  Loss Recovery Extensions

   For the typical implementation of the TCP fast recovery algorithm
   described in RFC 5681 (see Section 2 of this document), a TCP sender
   only retransmits a segment after a retransmit timeout has occurred,
   or after three duplicate ACKs have arrived triggering the fast
   retransmit.  A single RTO might result in the retransmission of
   several segments, while the fast retransmit algorithm in RFC 5681
   leads only to a single retransmission.  Hence, multiple losses from a
   single window of data can lead to a performance degradation.
   Documents listed in this section aim to improve the overall
   performance of TCP's standard loss recovery algorithms.  In
   particular, some of them allow TCP senders to recover more
   effectively when multiple segments are lost from a single flight of
   data.

   RFC 2018 S: "TCP Selective Acknowledgment Options" (October 1996)
               (Errata)

      When more than one packet is lost during one RTT, TCP may
      experience poor performance since a TCP sender can only learn
      about a single lost packet per RTT from cumulative
      acknowledgments.  This document [RFC2018] defines the basic
      selective acknowledgment (SACK) mechanism for TCP, which can help
      to overcome these limitations.  The receiving TCP returns SACK
      blocks to inform the sender which data has been received.  The
      sender can then retransmit only the missing data segments.


   RFC 3042 S: "Enhancing TCP's Loss Recovery Using Limited Transmit"
               (January 2001)

      Abstract of RFC 3042 [RFC3042]: "This document proposes a new
      Transmission Control Protocol (TCP) mechanism that can be used to
      more effectively recover lost segments when a connection's
      congestion window is small, or when a large number of segments are
      lost in a single transmission window."  This algorithm described
      in RFC 3042 is called "Limited Transmit". Limited Transmit is=20
      recommended by RFC 5681 (see Section 2 of this document).

   RFC 6582 S: "The NewReno Modification to TCP's Fast Recovery
               Algorithm" (April 2012)

      This document [RFC6582] specifies a modification to the standard
      Reno fast recovery algorithm, whereby a TCP sender can use partial
      acknowledgments to make inferences determining the next segment to
      send in situations where SACK would be helpful but isn't
      available.  Although it is only a slight modification, the NewReno
      behavior can make a significant difference in performance when
      multiple segments are lost from a single window of data.

   RFC 6675 S: "A Conservative Loss Recovery Algorithm Based on
               Selective Acknowledgment (SACK) for TCP" (August 2012)

      This document [RFC6675] describes a conservative loss recovery
      algorithm for TCP that is based on the use of the selective
      acknowledgment (SACK) TCP option [RFC2018] (see above in
      Section 3.3).  The algorithm conforms to the spirit of the
      congestion control specification in RFC 5681 (see Section 2 of
      this document), but allows TCP senders to recover more effectively
      when multiple segments are lost from a single flight of data.

      RFC 6675 is a revision of RFC 3517 to address several situations
      that are not handled explicitly before.  In particular,

      (a)  it improves the loss detection in the event that the sender
           has outstanding segments that are smaller than Sender Maximum
           Segment Size (SMSS).
      (b)  it modifies the definition of a "duplicate acknowledgment" to
           utilize the SACK information in detecting loss.
      (c)  it maintains the ACK clock under certain circumstances
           involving loss at the end of the window.


3.4.  Detection and Prevention of Spurious Retransmissions

   Spurious retransmission timeouts are harmful to TCP performance and
   multiple algorithms have been defined for detecting when spurious
   retransmissions have occurred, but they respond differently with
   regard to their manners of recovering performance.  The IETF defined
   multiple algorithms because there are trade-offs in whether or not
   certain TCP options need to be implemented and concerns about IPR
   status.  The Standards Track RFCs in this section are closely related
   to the Experimental RFCs in Section 4.5 also addressing this topic.


   RFC 2883 S: "An Extension to the Selective Acknowledgement (SACK)
               Option for TCP" (July 2000)

      This document [RFC2883] extends RFC 2018 (see Section 3.3 of this
      document).  It enables use of the SACK option to acknowledge
      duplicate packets.  With this extension, called DSACK, the sender
      is able to infer the order of packets received at the receiver
      and, therefore, to infer when it has unnecessarily retransmitted a
      packet.  A TCP sender could then use this information to detect
      spurious retransmissions (see [RFC3708]).

  RFC 4015 S: "The Eifel Response Algorithm for TCP" (February 2005)

      Abstract of RFC 4015 [RFC4015]: "Based on an appropriate detection
      algorithm, the Eifel response algorithm provides a way for a TCP
      sender to respond to a detected spurious timeout.


   RFC 5682 S: "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting
               Spurious Retransmission Timeouts with TCP" (September
               2009)

      The F-RTO detection algorithm [RFC5682], originally described in
      RFC 4138, provides an option for inferring spurious retransmission
      timeouts.  Unlike some similar detection methods (e.g., RFCs 3522
      and 3708, both listed in Section 4.5 of this document), F-RTO does
      not rely on the use of any TCP options.  The basic idea is to send
      previously unsent data after the first retransmission after a RTO.
      If the ACKs advance the window, the RTO may be declared spurious.

[3.5.]  Path MTU Discovery

   RFC 1191 S: "Path MTU Discovery" (November 1990)

   RFC 1981 S: "Path MTU Discovery for IP version 6" (August 1996)

   RFC 4821 S: "Packetization Layer Path MTU Discovery" (March 2007)

      Abstract of RFC 4821 [RFC4821]: "This document describes a robust
      method for Path MTU Discovery (PMTUD) that relies on TCP or some
      other Packetization Layer to probe an Internet path with
      progressively larger packets.


[3.6.]  Header Compression
   Especially in streaming applications, the overhead of TCP/IP headers
   could correspond to more than 50% of the total amount of data sent.
   Such large overheads may be tolerable in wired LANs where capacity is
   often not an issue, but are excessive for WANs and wireless systems
   where bandwidth is scarce.  Header compression schemes for TCP/IP
   like RObust Header Compression (ROHC) can significantly compress this
   overhead.  It performs well over links with significant error rates
   and long round-trip times.

   RFC 1144 S: "Compressing TCP/IP Headers for Low-Speed Serial Links"
               (February 1990)

   RFC 6846 S: "RObust Header Compression (ROHC): A Profile for TCP/IP
               (ROHC-TCP)" (January 2013)


3.7.  Defending Spoofing and Flooding Attacks

   By default, TCP lacks any cryptographic structures to differentiate
   legitimate segments from those spoofed from malicious hosts.
   Spoofing valid segments requires correctly guessing a number of
   fields.  The documents in this subsection describe ways to make that
   guessing harder or to prevent it from being able to affect a
   connection negatively.


   RFC 4953 I: "Defending TCP Against Spoofing Attacks" (July 2007)

   RFC 4987 I: "TCP SYN Flooding Attacks and Common Mitigations" (August
               2007)

   RFC 5925 S: "The TCP Authentication Option" (June 2010)

   RFC 5926 S: "Cryptographic Algorithms for the TCP Authentication
               Option (TCP-AO)" (June 2010)

   RFC 5927 I: "ICMP Attacks against TCP" (July 2010)

   RFC 5961 S: "Improving TCP's Robustness to Blind In-Window Attacks"
               (August 2010)

   RFC 6528 S: "Defending against Sequence Number Attacks" (February
               2012)

[4.]  Experimental Extensions

   The RFCs in this section are either Experimental and may become
   Proposed Standards in the future or are Proposed Standards (or
   Informational), but can be considered experimental due to lack of
   wide deployment.  At least part of the reason that they are still
   experimental is to gain more wide-scale experience with them before a
   standards track decision is made.

[4.1.]  Architectural Guidelines

   As multiple flows may share the same paths, sections of paths, or
   other resources, the TCP implementation may benefit from sharing
   information across TCP connections or other flows.  Some experimental
   proposals have been documented and some implementations have included
   the concepts.


   RFC 2140 I: "TCP Control Block Interdependence" (April 1997)


   RFC 3124 S: "The Congestion Manager" (June 2001)

      This document [RFC3124] is a related proposal to RFC 2140 (see
      above in Section 4.1).  The idea behind the Congestion Manager,
      moving congestion control outside of individual TCP connections,
      represents a modification to the core of TCP, which supports
      sharing information among TCP connections.  Although a Proposed
      Standard, some pieces of the Congestion Manager support
      architecture have not been specified yet, and it has not achieved
      use or implementation beyond experimental stacks, so it is not
      listed among the standard TCP enhancements in this roadmap.

[4.2.]  Fundamental Changes

   Like the Standards Track documents listed in Section 3.1, there also
   exist new Experimental RFCs that specify fundamental changes to TCP.
   At the time of writing, the only example so far is TCP Fast Open that
   deviates from the standard TCP semantics of [RFC793].

   RFC 7413 E: "TCP Fast Open" (December 2014)

      This document [RFC7413] describes TCP Fast Open that allows data
      to be carried in the SYN and SYN-ACK packets and consumed by the
      receiver during the initial connection handshake.

[4.3.]  Congestion Control Extensions

   TCP congestion control has been an extremely active research area for
   many years (see RFC 5783 discussed in Section 7.6 of this document),
   as it determines the performance of many applications that use TCP.
   A number of Experimental RFCs address issues with flow start up,
   overshoot, and steady-state behavior in the basic algorithms of RFC
   5681 (see Section 2 of this document).  In these subsections,
   enhancements to TCP's congestion control are listed.=20


   RFC 2861 E: "TCP Congestion Window Validation" (June 2000)


   RFC 3540 E: "Robust Explicit Congestion Notification (ECN) Signaling
               with Nonces" (June 2003)

   RFC 3649 E: "HighSpeed TCP for Large Congestion Windows" (December
               2003)

   RFC 3742 E: "Limited Slow-Start for TCP with Large Congestion
               Windows" (March 2004)


   RFC 4782 E: "Quick-Start for TCP and IP" (January 2007) (Errata)


   RFC 5562 E: "Adding Explicit Congestion Notification (ECN) Capability
               to TCP's SYN/ACK Packets" (June 2009)

   RFC 5690 I: "Adding Acknowledgement Congestion Control to TCP"
               (February 2010)


   RFC 6928 E: "Increasing TCP's Initial Window" (April 2013)

      This document [RFC6928] proposes to increase the TCP initial
      window from between 2 and 4 segments, as specified in RFC 3390
      (see Section 3.2 of this document), to 10 segments with a fallback
      to the existing recommendation when performance issues are
      detected.

[4.4.]  Loss Recovery Extensions

   RFC 5827 E: "Early Retransmit for TCP and Stream Control Transmission
               Protocol (SCTP)" (April 2010)

      This document [RFC5827] proposes the "Early Retransmit" mechanism
      for TCP (and SCTP) that can be used to recover lost segments when
      a connection's congestion window is small.  In certain special
      circumstances, Early Retransmit reduces the number of duplicate
      acknowledgments required to trigger fast retransmit to recover
      segment losses without waiting for a lengthy retransmission
      timeout.


   RFC 6069 E: "Making TCP More Robust to Long Connectivity Disruptions
               (TCP-LCD)" (December 2010)

   RFC 6937 E: "Proportional Rate Reduction for TCP" (May 2013)

      This document [RFC6937] describes an experimental Proportional
      Rate Reduction (PRR) algorithm as an alternative to the widely
      deployed Fast Recovery algorithm, to improve the accuracy of the
      amount of data sent by TCP during loss recovery.

[4.5.]  Detection and Prevention of Spurious Retransmissions

   In addition to the Standards Track extensions to deal with spurious
   retransmissions in Section 3.4, Experimental proposals have also been
   documented.

   RFC 3522 E: "The Eifel Detection Algorithm for TCP" (April 2003)


   RFC 3708 E: "Using TCP Duplicate Selective Acknowledgement (DSACKs)
               and Stream Control Transmission Protocol (SCTP) Duplicate
               Transmission Sequence Numbers (TSNs) to Detect Spurious
               Retransmissions" (February 2004)

   RFC 4653 E: "Improving the Robustness of TCP to Non-Congestion
               Events" (August 2006)

[4.6.]  TCP Timeouts


   RFC 5482 S: "TCP User Timeout Option" (March 2009)


[4.7.]  Multipath TCP

   MultiPath TCP (MPTCP) is an ongoing effort within the IETF that
   allows a TCP connection to simultaneously use multiple IP addresses /
   interfaces to spread their data across several subflows, while
   presenting a regular TCP interface to applications.  Benefits of this
   include better resource utilization, better throughput and smoother
   reaction to failures.  The documents listed in this section specify
   the Multipath TCP scheme, while the documents in Sections 7.2, 7.4,
   and 7.5 provide some additional background information.

   RFC 6356 E: "Coupled Congestion Control for Multipath Transport
               Protocols" (October 2011)


   RFC 6824 E: "TCP Extensions for Multipath Operation with Multiple
               Addresses" (January 2013) (Errata)

[5.]  TCP Parameters at IANA


   RFC 2780 B: "IANA Allocation Guidelines For Values In the Internet
               Protocol and Related Headers" (March 2000)

   RFC 4727 S: "Experimental Values in IPv4, IPv6, ICMPv4, ICMPv6, UDP,
               and TCP Headers" (November 2006)

   RFC 6335 B: "Internet Assigned Numbers Authority (IANA) Procedures
               for the Management of the Service Name and Transport
               Protocol Port Number Registry" (August 2011)

   RFC 6994 S: "Shared Use of Experimental TCP Options (August 2013)


[7.]  Support Documents

   This section contains several classes of documents that do not
   necessarily define current protocol behaviors but that are
   nevertheless of interest to TCP implementers.  Section 7.1 describes
   several foundational RFCs that give modern readers a better
   understanding of the principles underlying TCP's behaviors and
   development over the years.  Section 7.2 contains architectural
   guidelines and principles for TCP architects and designers.  The
   documents listed in Section 7.3 provide advice on using TCP in
   various types of network situations that pose challenges above those
   of typical wired links.  Guidance for developing, analyzing, and
   evaluating TCP is given in Section 7.4.  Some implementation notes
   and implementation advice can be found in Section 7.5.  RFCs that
   describe tools for testing and debugging TCP implementations or that
   contain high-level tutorials on the protocol are listed Section 7.6.
   The TCP Management Information Bases are described in Section 7.7,
   and Section 7.8 lists a number of case studies that have explored TCP
   performance.


7.4.  Guidance for Developing, Analyzing, and Evaluating TCP

   Documents in this section give general guidance for developing,
   analyzing, and evaluating TCP.  Some of the documents discuss, for
   example, the properties of congestion control protocols that are
   "safe" for Internet deployment as well as how to measure the
   properties of congestion control mechanisms and transport protocols.


   RFC 5033 B: "Specifying New Congestion Control Algorithms" (August
               2007)

      This document [RFC5033] considers the evaluation of suggested
      congestion control algorithms that differ from the principles
      outlined in RFC 2914 (see Section 7.2 of this document).  It is
      useful for authors of such algorithms as well as for IETF members
      reviewing the associated documents.

   RFC 5166 I: "Metrics for the Evaluation of Congestion Control
               Mechanisms" (March 2008)

      This document [RFC5166] discusses metrics that need to be
      considered when evaluating new or modified congestion control
      mechanisms for the Internet.  Among other topics, the document
      discusses throughput, delay, loss rates, response times, fairness,
      and robustness for challenging environments.

   RFC 6077 I: "Open Research Issues in Internet Congestion Control"
               (February 2011)

      This document [RFC6077] summarizes the main open problems in the
      domain of Internet congestion control.  As a good starting point
      for newcomers, the document describes several new challenges that
      are becoming important as the network grows, as well as some
      issues that have been known for many years.

   RFC 6181 I: "Threat Analysis for TCP Extensions for Multipath
               Operation with Multiple Addresses" (March 2011)

      This document [RFC6181] describes a threat analysis for Multipath
      TCP (MPTCP) (see Section 4.7 of this document).  The document
      discusses several types of attacks and provides recommendations
      for MPTCP designers how to create an MPTCP specification that is
      as secure as the current (single-path) TCP.

   RFC 6349 I: "Framework for TCP Throughput Testing" (August 2011)

      From the Abstract of RFC 6349 [RFC6349]: "This framework describes
      a practical methodology for measuring end-to-end TCP Throughput in
      a managed IP network.  The goal is to provide a better indication
      in regard to user experience.  In this framework, TCP and IP
      parameters are specified to optimize TCP Throughput."

7.5.  Implementation Advice

   RFC 794 U: "PRE-EMPTION" (September 1981)

      This document [RFC794] clarifies that operating systems need to
      manage their limited resources, which may include TCP connection
      state, and that these decisions can be made with application
      input, but they do not need to be part of the TCP protocol
      specification itself.

   RFC 879 U: "The TCP Maximum Segment Size and Related Topics"
               (November 1983)

   RFC 1071 U: "Computing the Internet Checksum" (September 1988)
               (Errata)

   RFC 1624 I: "Computation of the Internet Checksum via Incremental
               Update" (May 1994)

   RFC 1936 I: "Implementing the Internet Checksum in Hardware" (April
               1996)

   RFC 2525 I: "Known TCP Implementation Problems" (March 1999)

   RFC 2923 I: "TCP Problems with Path MTU Discovery" (September 2000)

   RFC 3493 I: "Basic Socket Interface Extensions for IPv6" (February
               2003)

   RFC 6056 B: "Recommendations for Transport-Protocol Port
               Randomization" (December 2010)

   RFC 6191 B: "Reducing the TIME-WAIT State Using TCP Timestamps"
               (April 2011)

   RFC 6429 I: "TCP Sender Clarification for Persist Condition"
               (December 2011)

   RFC 6897 I: "Multipath TCP (MPTCP) Application Interface
               Considerations" (March 2013)

7.6.  Tools and Tutorials

   RFC 1180 I: "TCP/IP Tutorial" (January 1991) (Errata)

      This document [RFC1180] is an extremely brief overview of the TCP/
      IP protocol suite as a whole.  It gives some explanation as to how
      and where TCP fits in.

   RFC 1470 I: "FYI on a Network Management Tool Catalog: Tools for
               Monitoring and Debugging TCP/IP Internets and
               Interconnected Devices" (June 1993)

      A few of the tools that this document [RFC1470] describes are
      still maintained and in use today, for example, ttcp and tcpdump.
      However, many of the tools described do not relate specifically to
      TCP and are no longer used or easily available.

   RFC 2398 I: "Some Testing Tools for TCP Implementors" (August 1998)

      This document [RFC2398] describes a number of TCP packet
      generation and analysis tools.  Although some of these tools are
      no longer readily available or widely used, for the most part they
      are still relevant and usable.

   RFC 5783 I: "Congestion Control in the RFC Series" (February 2010)

      This document [RFC5783] provides an overview of RFCs related to
      congestion control that had been published at the time.  The focus
      of the document is on end-host-based congestion control.

8.  Undocumented TCP Features

   There are a few important implementation tactics for the TCP that
   have not yet been described in any RFC.  Although this roadmap is
   primarily concerned with mapping the TCP RFCs, this section is
   included because an implementer needs to be aware of these important
   issues.


   Header Prediction

      Header prediction is a trick to speed up the processing of
      segments.  Van Jacobson and Mike Karels developed the technique in
      the late 1980s.  The basic idea is that some processing time can
      be saved when most of a segment's fields can be predicted from
      previous segments.  A good description of this was sent to the
      TCP-IP mailing list by Van Jacobson on March 9, 1988 (see
      [Jacobson] for the full message):

         Quite a bit of the speedup comes from an algorithm that we
         ('we' refers to collaborator Mike Karels and myself) are
         calling "header prediction".  The idea is that if you're in the
         middle of a bulk data transfer and have just seen a packet, you
         know what the next packet is going to look like: It will look
         just like the current packet with either the sequence number or
         ack number updated (depending on whether you're the sender or
         receiver).  Combining this with the "Use hints" epigram from
         Butler Lampson's classic "Epigrams for System Designers", you
         start to think of the tcp state (rcv.nxt, snd.una, etc.) as
         "hints" about what the next packet should look like.

         If you arrange those "hints" so they match the layout of a tcp
         packet header, it takes a single 14-byte compare to see if your
         prediction is correct (3 longword compares to pick up the send
         & ack sequence numbers, header length, flags and window, plus a
         short compare on the length).  If the prediction is correct,
         there's a single test on the length to see if you're the sender
         or receiver followed by the appropriate processing.  E.g., if
         the length is non-zero (you're the receiver), checksum and
         append the data to the socket buffer then wake any process
         that's sleeping on the buffer.  Update rcv.nxt by the length of
         this packet (this updates your "prediction" of the next
         packet).  Check if you can handle another packet the same size
         as the current one.  If not, set one of the unused flag bits in
         your header prediction to guarantee that the prediction will
         fail on the next packet and force you to go through full
         protocol processing.  Otherwise, you're done with this packet.
         So, the *total* tcp protocol processing, exclusive of
         checksumming, is on the order of 6 compares and an add.

   Forward Acknowledgement (FACK)

      FACK [MM96] includes an alternate algorithm for triggering fast
      retransmit [RFC5681], based on the extent of the SACK scoreboard.
      Its goal is to trigger fast retransmit as soon as the receiver's
      reassembly queue is larger than the duplicate ACK threshold, as
      indicated by the difference between the forward most SACK block
      edge and SND.UNA.  This algorithm quickly and reliably triggers
      fast retransmit in the presence of burst losses -- often on the
      first SACK following such a loss.  Such a threshold-based
      algorithm also triggers fast retransmit immediately in the
      presence of any reordering with extent greater than the duplicate
      ACK threshold.  FACK is implemented in Linux and turned on per
      default.

   Congestion Control for High Rate Flows

      In the last decade significant research effort has been put into
      experimental TCP congestion control modifications for obtaining
      high throughput with reduced startup and recovery times.  Only a
      few RFCs have been published on some of these modifications,
      including HighSpeed TCP [RFC3649], Limited Slow-Start [RFC3742],
      and Quick-Start [RFC4782] (see Section 4.3 of this document for
      more information on each), but high-rate congestion control
      mechanisms are still considered an open issue in congestion
      control research.  Some other schemes have been published as
      Internet-Drafts, e.g.  CUBIC [CUBIC] (the standard TCP congestion
      control algorithm in Linux), Compound TCP [CTCP], and H-TCP [HTCP]
      or have been discussed a little by the IETF, but much of the work
      in this area has not been adopted within the IETF yet, so the
      majority of this work is outside the RFC series and may be
      discussed in other products of the IRTF Internet Congestion
      Control Research Group (ICCRG).


Metrics for the Evaluation of Congestion Control Mechanisms
https://tools.ietf.org/html/rfc5166
   Discusses the metrics to be considered in an evaluation
   of new or modified congestion control mechanisms for the Internet.
   These include metrics for the evaluation of new transport protocols,
   of proposed modifications to TCP, of application-level congestion
   control, and of Active Queue Management (AQM) mechanisms in the
   router.  This document is the first in a series of documents aimed at
   improving the models that we use in the evaluation of transport
   protocols.

Types Of Metrics:

  - Throughput, Delay, and Loss Rates

    - Throughput: can be measured as

      - router-based metric of aggregate link utilization

      - flow-based metric of per-connection transfer times

      - user-based metric of utility functions or user wait times


    - Goodput: sometimes distinguished from throughput where throughput
      is the link utilization or flow rate in bytes per second; goodput
      is the subset of throughput (also measured in Bytes/s) consisting
      of useful traffic [i.e. excluding duplicate packets]

    - Delay: Like throughput, delay can be measured as a router-based metri=
c of
      queueing delay over time, or as a flow-based metric in terms of
      per-packet transfer times.  Per-packet delay can also include delay
      at the sender waiting for the transport protocol to send the packet.
      For reliable transfer, the per-packet transfer time seen by the
      application includes the possible delay of retransmitting a lost
      packet.

    - Packet Loss Rates: can be measured as a network-based or as a
      flow-based metric. One network-related reason to avoid high steady-
      state packet loss rates is to avoid congestion collapse in environmen=
ts=20
      containing paths with multiple congested links

  - Response Times and Minimizing Oscillations
   =20
    - Response to Changes: One of the key concerns in the design of congest=
ion=20
      control mechanisms has been the response times to sudden congestion i=
n the
      network.  On the one hand, congestion control mechanisms should
      respond reasonably promptly to sudden congestion from routing or
      bandwidth changes or from a burst of competing traffic.  At the same
      time, congestion control mechanisms should not respond too severely
      to transient changes, e.g., to a sudden increase in delay that will
      dissipate in less than the connection's round-trip time.

    - Minimizing Oscillations:  One goal is that of stability, in terms of=
=20
      minimizing oscillations of queueing delay or of throughput.  In pract=
ice,=20
      stability is frequently associated with rate fluctuations or variance=
. =20
      Rate variations can result in fluctuations in router queue size and
      therefore of queue overflows.  These queue overflows can cause loss
      synchronizations across coexisting flows and periodic under-utilizati=
on=20
      of link capacity, both of which are considered to be general signs of=
=20
      network instability.  Thus, measuring the rate variations of flows is=
=20
      often used to measure the stability of transport protocols.  To measu=
re=20
      rate variations, [JWL04], [RX05], and [FHPW00] use the coefficient of=
=20
      variation (CoV) of per-flow transmission rates, and [WCL05] suggests =
the
      use of standard deviations of per-flow rates.  Since rate variations =
are=20
      a function of time scales, it makes sense to measure these rate varia=
tions
      over various time scales.

  - Fairness and Convergence

    - Fairness between Flows: let x_i be the throughput for the i-th connec=
tion.

      - Jain's fairness index: The fairness index in [JCH84] is:
     =20
      =09(( sum_i x_i )^2) / (n * sum_i ( (x_i)^2 )),

      =09where there are n users.  This fairness index ranges from 0 to 1, =
and
      =09it is maximum when all users receive the same allocation.  This in=
dex
      =09is k/n when k users equally share the resource, and the other n-k
      =09users receive zero allocation.

      - The product measure:

      =09product_i x_i

=09the product of the throughput of the individual connections, is also
   =09used as a measure of fairness.  (In some contexts x_i is taken as the
   =09power of the i-th connection, and the product measure is referred to
   =09as network power.)  The product measure is particularly sensitive to
   =09segregation; the product measure is zero if any connection receives
   =09zero throughput. [N.B. If one normalizes to actual bandwidth by takin=
g=20
=09the Nth root of the product, where N =3D number of connections, this is
=09the geometric mean. The geometric mean will be less than the arithmetic
=09mean unless all flows have equivalent throughput.]

     - Epsilon-fairness: A rate allocation is defined as epsilon-fair if

         (min_i x_i) / (max_i x_i) >=3D 1 - epsilon
      =20
       Epsilon-fairness measures the worst-case ratio between any two throu=
ghput
       rates [ZKL04]. Epsilon-fairness is related to max-min fairness.

    - Fairness between Flows with Different Resource Requirements

      - Max-min fairness: In order to satisfy the max-min fairness criteria=
,
      =09the smallest throughput rate must be as large as possible.  Given
   =09this condition, the next-smallest throughput rate must be as large as
   =09possible, and so on.  Thus, the max-min fairness gives absolute
   =09priority to the smallest flows.  (Max-min fairness can be explained
   =09by the progressive filling algorithm, where all flow rates start at
   =09zero, and the rates all grow at the same pace.  Each flow rate stops
   =09growing only when one or more links on the path reach link capacity.)

     - Proportional fairness: A   feasible allocation, x, is
       defined as proportionally fair if, for any other feasible allocation
       x*, the aggregate of proportional changes is zero or negative:

       =09  sum_i ( (x*_i - x_i)/x_i ) <=3D 0.

       "This criterion favours smaller flows, but less emphatically than
       max-min fairness" [K01].  (Using the language of utility functions,
       proportional fairness can be achieved by using logarithmic utility
       functions, and maximizing the sum of the per-flow utility functions;
       see [KMT98] for a fuller explanation.)

     - Minimum potential delay fairness: Minimum potential delay fairness
       has been shown to model TCP [KS03], and is a compromise between
       max-min fairness and proportional fairness.  An allocation, x, is
       defined as having minimum potential delay fairness if:

             sum_i (1/x_i)

       is smaller than for any other feasible allocation.  That is, it woul=
d
       minimize the average download time if each flow was an equal-sized
       file.
    - Comments on Fairness

      - Trade-offs between fairness and throughput: The fairness measures i=
n
      =09the section above generally measure both fairness and throughput,
   =09giving different weights to each.  Potential trade-offs between
   =09fairness and throughput are also discussed by Tang, et al. in
   =09[TWL06], for a framework where max-min fairness is defined as the
   =09most fair.  In particular, [TWL06] shows that in some topologies,
   =09throughput is proportional to fairness, while in other topologies,
   =09throughput is inversely proportional to fairness.

     - Fairness and the number of congested links: Some of these fairness
       metrics are discussed in more detail in [F91].  We note that there i=
s
       not a clear consensus for the fairness goals, in particular for
       fairness between flows that traverse different numbers of congested
       links [F91].  Utility maximization provides one framework for
       describing this trade-off in fairness.

     - Fairness and round-trip times: One goal cited in a number of new
       transport protocols has been that of fairness between flows with
       different round-trip times [KHR02] [XHR04].  We note that there is
       not a consensus in the networking community about the desirability o=
f
       this goal, or about the implications and interactions between this
       goal and other metrics [FJ92] (Section 3.3).  One common argument
       against the goal of fairness between flows with different round-trip
       times has been that flows with long round-trip times consume more
       resources; this aspect is covered by the previous paragraph.
       Researchers have also noted the difference between the RTT-unfairnes=
s
       of standard TCP, and the greater RTT-unfairness of some proposed
       modifications to TCP [LLS05].

     - Fairness and packet size: One fairness issue is that of the relative
       fairness for flows with different packet sizes.  Many file transfer
       applications will use the maximum packet size possible;  in contrast=
,
       low-bandwidth VoIP flows are likely to send small packets, sending a
       new packet every 10 to 40 ms., to limit delay.  Should a small-packe=
t
       VoIP connection receive the same sending rate in *bytes* per second
       as a large-packet TCP connection in the same environment, or should
       it receive the same sending rate in *packets* per second?  This
       fairness issue has been discussed in more detail in [RFC3714], with
       [RFC4828] also describing the ways that packet size can affect the
       packet drop rate experienced by a flow.

     - Convergence times: Convergence times concern the time for convergenc=
e
       to fairness between an existing flow and a newly starting one, and
       are a special concern for environments with high-bandwidth long-dela=
y
       flows.  Convergence times also concern the time for convergence to
       fairness after a sudden change such as a change in the network path,
       the competing cross-traffic, or the characteristics of a wireless
       link.  As with fairness, convergence times can matter both between
       flows of the same protocol, and between flows using different
       protocols [SLFK03].  One metric used for convergence times is the
       delta-fair convergence time, defined as the time taken for two flows
       with the same round-trip time to go from shares of 100/101-th and
       1/101-th of the link bandwidth, to having close to fair sharing with
       shares of (1+delta)/2 and (1-delta)/2 of the link bandwidth [BBFS01]=
.
       A similar metric for convergence times measures the convergence time
       as the number of round-trip times for two flows to reach epsilon-
       fairness, when starting from a maximally-unfair state [ZKL04].


TCP Congestion Control (RFC 5681):
http://www.rfc-editor.org/rfc/rfc5681.txt
Specifies four TCP congestion algorithms: slow start, congestion
avoidance, fast retransmit and fast recovery. They were devised
in [Jac88] and [Jac90]. Their use with TCP is standardized in=20
[RFC1122].

In addition the document specifies what TCP connections should do after
a relatively long idle period, as well as clarifying some of the issues
pertaining to TCP ACK generation.

Obsoletes [RFC2581], which in turn obsoleted [RFC2001].

The slow start and congestion avoidance algorithms MUST be used by the=20
TCP sender to control the amount of outstanding data being injected into
the network. These add three state variables.

    - Congestion Window (cwnd): a sender-side limit on the amount of data=
=20
      the sender can transmit before receiving an ACK.
    - Receiver's Advertised Window (rwnd):  a receiver-side limit o the amo=
unt=20
      of outstanding data.=20
    - Slow Start Threshold (ssthresh): used to determine whether the slow s=
tart=20
      or congestion avoidance algorithm is used to control data transmissio=
n.

Slow Start: Used to determine available link capacity at the beginning of a
transfer, after repairing loss detected by the retransmission timer, or=20
[potentially] after a long idle period. It is additionally used to start th=
e=20
"ACK clock".

    - SMSS: Sender Maximum Segment Size
    - IW: Initial Window, the initial value of cwnd, MUST be set using the=
=20
      following guidelines as an upper bound
     =20
      If SMSS > 2190 bytes:
       =09IW =3D 2 * SMSS bytes and MUST NOT be more than 2 segments
      If (SMSS > 1095 bytes) and (SMSS <=3D 2190 bytes):
      =09IW =3D 3 * SMSS bytes and MUST NOT be more than 3 segments
      If SMSS <=3D 1095 bytes:
       =09IW =3D 4 * SMSS bytes and MUST NOT be more than 4 segments

    - Ssthresh:=20

      - SHOULD be set arbitrarily high (e.g., to the size of the largest=20
      =09possible advertised window), but ssthresh MUST be reduced in respo=
nse
=09to congestion.

      - The slow start algorithm is used when cwnd < ssthresh, while the
      =09congestion avoidance algorithm is used when cwnd > ssthresh.  When
  =09cwnd and ssthresh are equal, the sender may use either slow start or
   =09congestion avoidance.

      - When a TCP sender detects segment loss using the retransmission tim=
er
      =09and the given segment has not yet been resent once by way of the
   =09retransmission timer, the value of ssthresh MUST be set to no more
   =09than the value given in equation (4):

      =09ssthresh =3D max (FlightSize / 2, 2*SMSS)            (4)

=09Where Flightsize is the amount of outstanding data in the network.
=20
    - Growing cwnd: During slow start, a TCP increments cwnd by at most SMS=
S=20
      bytes for each ACK received that cumulatively acknowledges new data.=
=20
      Slow start ends when cwnd reaches or exceeds ssthresh.
     =20
      - Traditionally TCP implementations have increased cwnd by precisely
      =09SMSS bytes upon receipt of an ACK covering new data, we RECOMMEND
   =09that TCP implementations increase cwnd, per:=20

=09cwnd +=3D min (N, SMSS)  (2)

   =09where N is the number of previously unacknowledged bytes acknowledged
=09in the incoming ACK.

Congestion Avoidance: during congestion avoidance, cwnd is incremented by
roughly 1 full-sized segment per RTT. Congestion avoidance continues until
congestion is detected. The basic guidelines for incrementing cwnd are:

     - MAY increment cwnd by SMSS bytes

     - SHOULD increment cwnd per equation (2) once per RTT

     - MUST NOT increment cwnd by more than SMSS bytesb

[RFC3465] allows for cwnd increases of more than SMSS bytes for incoming=20
acknowledgments during slow start on an experimental basis; however, such=
=20
behavior is not allowed as part of the standard.


Another common formula that a TCP MAY use to update cwnd during
congestion avoidance is given in equation (3):

   cwnd +=3D SMSS*SMSS/cwnd                     (3)

This adjustment is executed on every incoming ACK that acknowledges
new data.  Equation (3) provides an acceptable approximation to the
underlying principle of increasing cwnd by 1 full-sized segment per
RTT.

Upon a timeout (as specified in [RFC2988]) cwnd MUST be
set to no more than the loss window, LW, which equals 1 full-sized
segment (regardless of the value of IW).  Therefore, after
retransmitting the dropped segment the TCP sender uses the slow start
algorithm to increase the window from 1 full-sized segment to the new
value of ssthresh, at which point congestion avoidance again takes over.


Fast Retransmit/Fast Recovery: A TCP receiver SHOULD send an immediate=20
duplicate ACK when an out-of-order segment arrives.  The purpose of this AC=
K=20
is to inform the sender that a segment was received out-of-order and which=
=20
sequence number is expected. In addition, a TCP receiver SHOULD send an=20
immediate ACK when the incoming segment fills in all or part of a gap in th=
e=20
sequence space.  This will generate more timely information for a sender
recovering from a loss through a retransmission timeout, a fast retransmit,=
 or
an advanced loss recovery algorithm.


The TCP sender SHOULD use the "fast retransmit" algorithm to detect and rep=
air
loss, based on incoming duplicate ACKs.  The fast retransmit algorithm uses=
 the
arrival of 3 duplicate ACKs as an indication that a segment has been lost.
TCP then performs a retransmission of what appears to be the missing segmen=
t,=20
without waiting for the retransmission timer to expire.


The fast retransmit and fast recovery algorithms are implemented
   together as follows.

   1.  On the first and second duplicate ACKs received at a sender, a
       TCP SHOULD send a segment of previously unsent data per [RFC3042]
       provided that the receiver's advertised window allows, the total
       FlightSize would remain less than or equal to cwnd plus 2*SMSS,
       and that new data is available for transmission.  Further, the
       TCP sender MUST NOT change cwnd to reflect these two segments
       [RFC3042].  Note that a sender using SACK [RFC2018] MUST NOT send
       new data unless the incoming duplicate acknowledgment contains
       new SACK information.

   2.  When the third duplicate ACK is received, a TCP MUST set ssthresh
       to no more than the value given in equation (4).  When [RFC3042]
       is in use, additional data sent in limited transmit MUST NOT be
       included in this calculation.

   3.  The lost segment starting at SND.UNA MUST be retransmitted and
       cwnd set to ssthresh plus 3*SMSS.  This artificially "inflates"
       the congestion window by the number of segments (three) that have
       left the network and which the receiver has buffered.

   4.  For each additional duplicate ACK received (after the third),
       cwnd MUST be incremented by SMSS.  This artificially inflates the
       congestion window in order to reflect the additional segment that
       has left the network.

       Note: [SCWA99] discusses a receiver-based attack whereby many
       bogus duplicate ACKs are sent to the data sender in order to
       artificially inflate cwnd and cause a higher than appropriate
       sending rate to be used.  A TCP MAY therefore limit the number of
       times cwnd is artificially inflated during loss recovery to the
       number of outstanding segments (or, an approximation thereof).

       Note: When an advanced loss recovery mechanism (such as outlined
       in section 4.3) is not in use, this increase in FlightSize can
       cause equation (4) to slightly inflate cwnd and ssthresh, as some
       of the segments between SND.UNA and SND.NXT are assumed to have
       left the network but are still reflected in FlightSize.

   5.  When previously unsent data is available and the new value of
       cwnd and the receiver's advertised window allow, a TCP SHOULD
       send 1*SMSS bytes of previously unsent data.

   6.  When the next ACK arrives that acknowledges previously
       unacknowledged data, a TCP MUST set cwnd to ssthresh (the value
       set in step 2).  This is termed "deflating" the window.

       This ACK should be the acknowledgment elicited by the
       retransmission from step 3, one RTT after the retransmission
       (though it may arrive sooner in the presence of significant out-
       of-order delivery of data segments at the receiver).
       Additionally, this ACK should acknowledge all the intermediate
       segments sent between the lost segment and the receipt of the
       third duplicate ACK, if none of these were lost.

   Note: This algorithm is known to generally not recover efficiently
   from multiple losses in a single flight of packets=20


RTO:
https://tools.ietf.org/html/rfc6298
Does not modify the behaviour in RFC 5681.

The RTO is a function of two state variables, SRTT and RTTVAR. The
following constants are used for calculations:
=09G <- clock granularity in seconds
=09K <- 4
[(2.1)] Until a round-trip time (RTT) measurment has been made for a segmen=
t
sent between the sender and the receiver, the sender SHOULD set RTO <- 1 se=
cond,
[i.e. not the outdated 3s currently in FreeBSD] - the "backing off" on repe=
ated=20
retransmission still applies.

[(2.2)] When the first RTT measurement R is made, the host MUST set
=09SRTT <- R
=09RTTVAR <- R/2
=09RTO <- SRTT + max (G, K*RTTVAR)

[(2.3)] When a subsequent RTT measurement R' is made, a host must set
=09RTTVAR <- (1 - beta)*RTTVAR + beta * |SRTT - R'|
=09SRTT <- (1 - alpha)*SRTT + alpha*R'

The value of SRTT used in updating RTTVAR is the one prior to the update
in the second assignment - i.e. the updates are done RTTVAR then SRTT.
The above calculation SHOULD be done with alpha=3D1/8 and beta=3D1/4 (as
suggested in [JK88]). [N.B. Should these values be smaller in the data
center so that the SRTT maintains a longer memory and isn't compromised
by a transient microburst?].

[(2.4)] Whenever RTO is computed, if it is less than 1 second, then the
         RTO SHOULD be rounded up to 1 second. [See the incast section
=09 for why this is unequivocally wrong in the data center]

         Traditionally, TCP implementations use coarse grain clocks to
         measure the RTT and trigger the RTO, which imposes a large
         minimum value on the RTO.  Research suggests that a large
         minimum RTO is needed to keep TCP conservative and avoid
         spurious retransmissions [AP99].  Therefore, this specification
         requires a large minimum RTO as a conservative approach, while
=09 at the same time acknowledging that at some future point,
         research may show that a smaller minimum RTO is acceptable or
         superior. [Vasudevan09 (incast section) clearly shows this to
=09 be the case.]


   Note that a TCP implementation MAY clear SRTT and RTTVAR after
   backing off the timer multiple times as it is likely that the current
   SRTT and RTTVAR are bogus in this situation.  Once SRTT and RTTVAR
   are cleared, they should be initialized with the next RTT sample
   taken per (2.2) rather than using (2.3).

[(7)]  Changes from RFC 2988
   This document reduces the initial RTO from the previous 3 seconds
   [PA00] to 1 second, unless the SYN or the ACK of the SYN is lost, in
   which case the default RTO is reverted to 3 seconds before data
   transmission begins.

Increasing TCP's intial window:
http://www.rfc-editor.org/rfc/rfc3390.txt
http://www.rfc-editor.org/rfc/rfc6928.txt

Proposes an experiment to increase the permitted TCP
initial window (IW) from between 2 and 4 segments, as specified in
RFC 3390, to 10 segments with a fallback to the existing
recommendation when performance issues are detected.  It discusses
the motivation behind the increase, the advantages and disadvantages
of the higher initial window, and presents results from several
large-scale experiments showing that the higher initial window
improves the overall performance of many web services without
resulting in a congestion collapse.=20

TCP Modification:=20
    - The upper bound for the initial window will be:=20
    =20
=09min (10*MSS, max (2*MSS, 14600))

    - This change applies to the initial window of the connection in the
      first round-trip time (RTT) of data transmission during or following
      the TCP three-way handshake.

    -  all the test results described in this document were based
       on the regular Ethernet MTU of 1500 bytes.  Future study of the
       effect of a different MTU may be needed to fully validate (1) above.

    - [In contrast to RFC 3390 and RFC 5681] The proposed change to reduce =
the=20
      default retransmission timeout (RTO) to 1 second [RFC6298] increases =
the=20
      chance for spurious SYN or SYN/ACK retransmission, thus unnecessarily=
=20
      penalizing connections with RTT > 1 second if their initial window is=
=20
      reduced to 1 segment. For this reason, it is RECOMMENDED that=20
      implementations refrain from resetting the initial window to 1 segmen=
t,=20
      unless there have been more than one SYN or SYN/ACK retransmissions o=
r=20
      true loss detection has been made.

    - TCP implementations use slow start in as many as three different
      ways: (1) to start a new connection (the initial window); (2) to
      restart transmission after a long idle period (the restart window);
      and (3) to restart transmission after a retransmit timeout (the loss
      window).  The change specified in this document affects the value of
      the initial window.  Optionally, a TCP MAY set the restart window to
      the minimum of the value used for the initial window and the current
      value of cwnd (in other words, using a larger value for the restart
      window should never increase the size of cwnd).  These changes do NOT
      change the loss window, which must remain 1 segment of MSS bytes (to
      permit the lowest possible window size in the case of severe congesti=
on).

    - To limit any negative effect that a larger initial
      window may have on links with limited bandwidth or buffer space,
      implementations SHOULD fall back to RFC 3390 for the restart window
      (RW) if any packet loss is detected during either the initial window
      or a restart window, and more than 4 KB of data is sent.


4.  Background

    - According to the latest report from Akamai [AKAM10],
      the global broadband (> 2 Mbps) adoption has surpassed 50%,
      propelling the average connection speed to reach 1.7 Mbps, while the
      narrowband (< 256 Kbps) usage has dropped to 5%.  In contrast, TCP's
      initial window has remained 4 KB for a decade [RFC2414],
      corresponding to a bandwidth utilization of less than 200 Kbps per
      connection, assuming an RTT of 200 ms.

   - A large proportion of flows on the Internet are short web
     transactions over TCP and complete before exiting TCP slow start.

   - applications have responded to TCP's "slow" start.
     Web sites use multiple subdomains [Bel10] to circumvent HTTP 1.1
     regulation on two connections per physical host [RFC2616].  As of
     today, major web browsers open multiple connections to the same site
     (up to six connections per domain [Ste08] and the number is growing).
     This trend is to remedy HTTP serialized download to achieve
     parallelism and higher performance.  But it also implies that today
     most access links are severely under-utilized, hence having multiple
     TCP connections improves performance most of the time.

   - persistent connections and pipelining are designed to
     address some of the above issues with HTTP [RFC2616].  Their presence
     does not diminish the need for a larger initial window, e.g., data
     from the Chrome browser shows that 35% of HTTP requests are made on
     new TCP connections.  Our test data also shows significant latency
     reduction with the large initial window even in conjunction with
     these two HTTP features [Duk10].

5. Advantages of Larger Initial Windows

   - Reducing Latency

     An increase of the initial window from 3 segments to 10 segments
     reduces the total transfer time for data sets greater than 4 KB by up
     to 4 round trips.

     The table below compares the number of round trips between IW=3D3 and
     IW=3D10 for different transfer sizes, assuming infinite bandwidth, no
     packet loss, and the standard delayed ACKs with large delayed-ACK
     timer.
            ---------------------------------------
           | total segments |   IW=3D3   |   IW=3D10   |
            ---------------------------------------
           |         3      |     1    |      1    |
           |         6      |     2    |      1    |
           |        10      |     3    |      1    |
           |        12      |     3    |      2    |
           |        21      |     4    |      2    |
           |        25      |     5    |      2    |
           |        33      |     5    |      3    |
           |        46      |     6    |      3    |
           |        51      |     6    |      4    |
           |        78      |     7    |      4    |
           |        79      |     8    |      4    |
           |       120      |     8    |      5    |
           |       127      |     9    |      5    |
            ---------------------------------------

   For example, with the larger initial window, a transfer of 32
   segments of data will require only 2 rather than 5 round trips to
   complete.

   - Recovering Faster from Loss on Under-Utilized or Wireless Links

     A greater-than-3-segment initial window increases the chance to
     recover packet loss through Fast Retransmit rather than the lengthy
     initial RTO [RFC5681].  This is because the fast retransmit algorithm
     requires three duplicate ACKs as an indication that a segment has
     been lost rather than reordered.  While newer loss recovery
     techniques such as Limited Transmit [RFC3042] and Early Retransmit
     [RFC5827] have been proposed to help speeding up loss recovery from a
     smaller window, both algorithms can still benefit from the larger
     initial window because of a better chance to receive more ACKs.


8.  Mitigation of Negative Impact

   Much of the negative impact from an increase in the initial window is
   likely to be felt by users behind slow links with limited buffers.
   The negative impact can be mitigated by hosts directly connected to a
   low-speed link advertising an initial receive window smaller than 10
   segments.  This can be achieved either through manual configuration
   by the users or through the host stack auto-detecting the low-
   bandwidth links.

   Additional suggestions to improve the end-to-end performance of slow
   links can be found in RFC 3150 [RFC3150].


RTO & High Performance:
https://tools.ietf.org/html/rfc7323
Updates the venerable RFC 1361.

[Also in RFC1361]
        An additional mechanism could be added to the TCP, a per-host
        cache of the last timestamp received from any connection.  This
        value could then be used in the PAWS mechanism to reject old
        duplicate segments from earlier incarnations of the connection,
        if the timestamp clock can be guaranteed to have ticked at least
        once since the old connection was open.  This would require that
        the TIME-WAIT delay plus the RTT together must be at least one
        tick of the sender's timestamp clock.  Such an extension is not
        part of the proposal of this RFC.


Appendix G.  RTO Calculation Modification

   Taking multiple RTT samples per window would shorten the history
   calculated by the RTO mechanism in [RFC6298], and the below algorithm
   aims to maintain a similar history as originally intended by
   [RFC6298].=20

   It is roughly known how many samples a congestion window worth of
   data will yield, not accounting for ACK compression, and ACK losses.
   Such events will result in more history of the path being reflected
   in the final value for RTO, and are uncritical.  This modification
   will ensure that a similar amount of time is taken into account for
   the RTO estimation, regardless of how many samples are taken per
   window:

      ExpectedSamples =3D ceiling(FlightSize / (SMSS * 2))

      alpha' =3D alpha / ExpectedSamples

      beta' =3D beta / ExpectedSamples

   Note that the factor 2 in ExpectedSamples is due to "Delayed ACKs".
   Instead of using alpha and beta in the algorithm of [RFC6298], use
   alpha' and beta' instead:

      RTTVAR <- (1 - beta') * RTTVAR + beta' * |SRTT - R'|

      SRTT <- (1 - alpha') * SRTT + alpha' * R'

      (for each sample R')

   =20
Appendix H.  Changes from RFC 1323

   Several important updates and clarifications to the specification in
   RFC 1323 are made in this document.  The [important] technical changes a=
re
   summarized below:

   (d)  The description of which TSecr values can be used to update the
        measured RTT has been clarified.  Specifically, with timestamps,
        the Karn algorithm [Karn87] is disabled.  The Karn algorithm
        disables all RTT measurements during retransmission, since it is
        ambiguous whether the <ACK> is for the original segment, or the
        retransmitted segment.  With timestamps, that ambiguity is
        removed since the TSecr in the <ACK> will contain the TSval from
        whichever data segment made it to the destination.

   (e)  RTTM update processing explicitly excludes segments not updating
        SND.UNA.  The original text could be interpreted to allow taking
        RTT samples when SACK acknowledges some new, non-continuous
        data.

   (f)  In RFC 1323, Section 3.4, step (2) of the algorithm to control
        which timestamp is echoed was incorrect in two regards:

        (1)  It failed to update TS.Recent for a retransmitted segment
             that resulted from a lost <ACK>.

        (2)  It failed if SEG.LEN =3D 0.

        In the new algorithm, the case of SEG.TSval >=3D TS.Recent is
        included for consistency with the PAWS test.

   (g)  It is now recommended that the Timestamps option is included in
        <RST> segments if the incoming segment contained a Timestamps
        option.

   (h)  <RST> segments are explicitly excluded from PAWS processing.

   (j)  Snd.TSoffset and Snd.TSclock variables have been added.
        Snd.TSclock is the sum of my.TSclock and Snd.TSoffset.  This
        allows the starting points for timestamp values to be randomized
        on a per-connection basis.  Setting Snd.TSoffset to zero yields
        the same results as [RFC1323].  Text was added to guide
        implementers to the proper selection of these offsets, as
        entirely random offsets for each new connection will conflict
        with PAWS.


Congestion Window Validation (CWV):
http://www.ietf.org/proceedings/69/slides/tcpm-7.pdf
https://tools.ietf.org/html/rfc7661

Provides a mechanism to address issues that arise when
TCP is used for traffic that exhibits periods where the sending rate
is limited by the application rather than the congestion window. This=20
RFC provides an experimental update to TCP that allows a TCP sender to
restart quickly following a rate-limited interval.  This method is
expected to benefit applications that send rate-limited traffic using
TCP while also providing an appropriate response if congestion is
experienced.

Motivation:
   Standard TCP states that a TCP sender SHOULD set cwnd to no more than
   the Restart Window (RW) before beginning transmission if the TCP
   sender has not sent data in an interval exceeding the retransmission
   timeout, i.e., when an application becomes idle [RFC5681].  [RFC2861]
   notes that this TCP behaviour was not always observed in current
   implementations.  Experiments confirm this to still be the case (see
   [Bis08]).

   Congestion Window Validation (CWV) [RFC2861] introduced the term
   "application-limited period" for the time when the sender sends less
   than is allowed by the congestion or receiver windows.


   Standard TCP does not impose additional restrictions on the growth of
   the congestion window when a TCP sender is unable to send at the
   maximum rate allowed by the cwnd.  In this case, the rate-limited
   sender may grow a cwnd far beyond that corresponding to the current
   transmit rate, resulting in a value that does not reflect current
   information about the state of the network path the flow is using.
   Use of such an invalid cwnd may result in reduced application
   performance and/or could significantly contribute to network
   congestion.


Active Queue Management (AQM):

Active Queue Management is an effort to avoid the latency increases (and in=
crease in time in the=20
feedback loop) and bursty losses caused by naive tail drop in intermediate =
buffering. The concept
was introduced along with a discussion of the queue management algorithm "R=
ED" (Random Early=20
Detect/Drop) by RFC 2309. The most current RFC is 7567.

The usual mix of long high throughput and short low latency flows place con=
flicting demands on=20
the queue occupancy of a switch:

   o  The queue must be short enough that it does not impose excessive
      latency on short flows.
   o  The queue must be long enough to buffer sufficient data for the
      long flows to saturate the path capacity.
   o  The queue must be short enough to absorb incast bursts without
      excessive packet loss.
=20
RED:
   The RED algorithm itself consists of two main parts: estimation of
   the average queue size and the decision of whether or not to drop an
   incoming packet.

   (a) Estimation of Average Queue Size

        RED estimates the average queue size, either in the forwarding
        path using a simple exponentially weighted moving average (such
        as presented in Appendix A of [Jacobson88]), or in the
        background (i.e., not in the forwarding path) using a similar
        mechanism.

   (b) Packet Drop Decision

        In the second portion of the algorithm, RED decides whether or
        not to drop an incoming packet.  It is RED's particular
        algorithm for dropping that results in performance improvement
        for responsive flows.  Two RED parameters, minth (minimum
        threshold) and maxth (maximum threshold), figure prominently in
        this decision process.  Minth specifies the average queue size
        *below which* no packets will be dropped, while maxth specifies
        the average queue size *above which* all packets will be
        dropped.  As the average queue size varies from minth to maxth,
        packets will be dropped with a probability that varies linearly
        from 0 to maxp.


Recommendations on Queue Management and Congestion Avoidance
in the Internet
https://tools.ietf.org/html/rfc2309

IETF Recommendations Regarding Active Queue Management
https://tools.ietf.org/html/rfc7567

https://en.wikipedia.org/wiki/Active_queue_management


Explicit Congestion Notification (ECN):
At its core ECN in TCP allows compliant routers to provide compliant sender=
s with notification
of "virtual drops" as a congestion indicator to halve its congestion window=
. This allows the=20
sender to not wait for the retransmit timeout or repeated ACKS to learn of =
a congestion=20
event and allows the receiver to avoid latency induced by drop/retransmit. =
ECN relies on some=20
form of AQM in the intermediate routers/switches to determine the marking t=
he CE (congestion
encountered) bit IP header, it is then the receiver's responsibility to mar=
k the ECE (ECN-Echo)=20
in the TCP header of the subsequent ACK. The receiver will continue to send=
 packets marked with=20
the ECE bit until it receives a packet with the CWR (Congestion Window Redu=
ced) bit set. Note=20
that although this last design decision makes it robust in the presence of =
ack loss (the=20
original version ECN specifies that ACKs / SYNs / SYN-ACKs not be marked as=
 ECN capable and=20
thus are not eligible for marking), it limits the use of ECN to once per RT=
T. As we'll see
later this leads to interoperability issues with DCTCP.

ECN is negotiated at connection time. In FreeBSD it is configured by a sysc=
tl defaulting to off
for all connections. Enabling the sysctl enables it for all connections. Th=
e last time a survey=20
was done, 2.7% of the internet would not respond to a SYN negotiating ECN. =
This isn't fatal as=20
subsequent SYNs will switch to not requesting ECN. This just adds the defau=
lt RTO to connection
establishment (3s in FreeBSD, 1s per RFC6298 - discussed later).

Linux has some very common sense configurability improvements. Its ECN knob=
 takes on _3_ values:
0) no request / no accept 1) no request / accept 2) request / accept. The d=
efault is (1),=20
supporting it for those adventurous enough to request it. The route command=
 can specify ECN by
subnet. In effect allowing servers / clients to only use it within a data c=
enter or between=20
compliant data centers.

ECN sees very little usage due to continued compatibility concerns. Althoug=
h the difficulty of
correctly tuning maxth and minth in RED and many other AQM mechanisms is no=
t specific to ECN,=20
RED et al are necessary to use ECN and thus further add to associated diffi=
culties of its use.


Talks:
More Accurate ECN Feedback in TCP (AccECN)
- https://www.ietf.org/proceedings/90/slides/slides-90-tcpm-10.pdf

ECN is slow, does not report condition extent, just it's existence. It lack=
s inter-
operability with DCTCP. Need to add mechanism for negotiating finer-grained=
,=20
adaptive congestion notification.=20


RFCS:

A Proposal to add Explicit Congestion Notification (ECN) to IP
- https://tools.ietf.org/html/rfc2481

Initial proposal.


The Addition of Explicit Congestion Notification (ECN) to IP
- https://tools.ietf.org/html/rfc3168

Elaboration and further specification of how to tie it in to TCP.

=20
Adding Explicit Congestion Notification (ECN) Capability to TCP's SYN/ACK P=
ackets
- https://tools.ietf.org/html/rfc5562

Sometimes referred to as ECN+. This extends ECN to SYN/ACK packets. Note th=
at SYN
packets are still not covered, being considered a potential security hole.


Accurate ECN (AccECN)
Problem Statement and Requirements for Increased Accuracy
in Explicit Congestion Notification (ECN) Feedback

- https://tools.ietf.org/html/rfc7560

Problem Statement and Requirements for Increased Accuracy
in Explicit Congestion Notification (ECN) Feedback
   "A primary motivation
   for this document is to intervene before each proprietary
   implementation invents its own non-interoperable handshake, which
   could lead to _de facto_ consumption of the few flags or codepoints
   that remain available for standardizing capability negotiation."


Incast:
The term was coined in [PANFS] for the case of increasing the number of
simultaneously initiated, effectively barrier synchronized, fan-in flows=20
in to a single port to the point where the instantaneous switch / NIC buffe=
ring
capacity was exceeded. Thus causing a decline in aggregate bandwidth as the=
 need
for re-transmits increases. This is further exacerbated by tail-drop behavi=
or in
the switch whereby multiple losses within individual streams exceeds the re=
-
covery abilities of duplicate ACKs or SACK, leading to RTOs before the flow=
 is=20
resumed.


The Panasas ActiveScale Storage Cluster - Delivering Scalable
High Bandwidth Storage [PANFS]
- http://acm.supercomputing.org/sc2004/schedule/pdfs/pap207.pdf

Focuses on the Object-based Storage Device (OSD) component backing the PanF=
S=20
distributed file system. PanFS runs on the client, backend storage consists=
 of=20
networked block devices (OSD). The intelligence consists in how stripes are=
 laid
out across OSD. PanFS relies on a Metadata Server (MDS) to control the inte=
raction
of clients with the objects on OSDs and maintain cache coherency.

Scalable bandwidth is achieved through aggregation by striping data across =
many
OSDs. Although in principle it would be desirable to stripe files as widely=
 as
possible. In practice, in their 1Gbps testbed (this is 2004) bandwidth scal=
ed
linearly from 3 to 7  OSDs but then after 14 OSDs aggregate bandwidth actua=
lly
decreases. With a 10ms disk access latency, if just one OSD experienced eno=
ugh=20
packet loss to result in one 200ms RTO the system would suffer a 10x decrea=
se in
performance.

Changes to address the incast problem:
  - Reduce the minRTO from 200ms to 50ms.
  - Tuning the _individual, socket buffer size. While a client must have a =
large
    aggregate receive buffer size, each individual stream's receive buffer =
should
    be relatively small. Thus they reduced the clients' (per OSD) receive s=
ocket
    buffer to under 64K.
  - To reduce the size of a single synchronized incast response PanFS imple=
ments
    a two level striping pattern. The first level is optimized for RAID's p=
arity
    update performance and read overhead. The second level of striping is d=
esigned
    to resist incast induced bandwidth penalties by stacking successive par=
ity
    stripes that are stacked in the same subset of objects. They call N seq=
uential
    parity stripes that are stacked in the same set of objects a 'visit', b=
ecause
    a client repeatedly feteches data from just a few OSDs (whose number is=
=20
    controlled by parity stripe width) for a while, then moves on to the ne=
xt set
    of OSDs. This striping pattern minimizes simultaneous fan-in and thus t=
he=20
    potential for incast. Typically PanFS stripes about 1GB of data per vis=
it,
    using a round-robin layout algorithm of visits across all OSDs.


Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storag=
e Systems
- https://www.usenix.org/legacy/event/fast08/tech/full_papers/phanishayee/p=
hanishayee_html/

Attempts to do a more general analysis of incast than [PANFS]. Analysis is =
based on
the model of a cluster-based storage system with data blocks striped over a=
 number of
servers. They refer to a single block fragmented over multiple servers as a=
 Server
Request Unit (SRU). A subsequent block request will only be made after the =
client=20
has received all the data for the current block. They refer to such reads a=
s=20
'synchronized reads'. The paper makes three contributions to the literature=
:

  - Explores the root causes of incast, characterizing it under a variety o=
f=20
    conditions (buffer space, varying number of servers, etc.). Buffer spac=
e can
    delay the onset of Incast, but any particular switch configuration will=
 have
    some maximum number of servers that can send simultaneously before=20
    throughput collapse occurs.
   =20
    - Reproduce incast collapse on 3 different models of switches. In some =
cases
      disabling QoS can help delay incast by freeing up packet buffers for=
=20
      general switching.

    - Demonstrate applicability of simulation by showing that the throughpu=
t=20
      collapse curve produced by ns-2 with a simulated 32KB buffer closely
      matches that shown by the HP Procurve 2848 with QoS disabled.
   =20
    - Analysis of TCP traces obtained from simulation reveals that TCP re-
      transmission timeouts are the primary cause of incast.

    - Displays the effect of varying the switch buffer size. Doubling the s=
ize
      of the switch's output port buffer doubles the number of servers that=
 can=20
      be supported before the system experiences incast.

    - TCP performs well in settings without synchronized reads, which can
      be modelled by an infinite SRU size. Running netperf across many serv=
ers
      does not induce incast. With larger SRU sizes servers can use the spa=
re
      link capacity made available by any stalled flow waiting for a timeou=
t
      event.=20


  - Examines the effectiveness of existing TCP variants (e.g. Reno, NewReno=
,
    SACK, and limited transmit). Although the move from Reno to NewReno=20
    improves performance, none of the additional improvements help. When TC=
P
    loses all packets in its window or loses retransmissions, no clever los=
s
    recovery algorithms can help.

  - Examine a set of techniques that are moderately effective in masking In=
cast,
    such as drastically reducing TCP's retransmission timeout timer. None o=
f
    these techniques are without drawbacks.
   =20
    - reducing RTOmin from 200ms to 200us improves throughput by an order o=
f
      magnitde for 8-32 servers. However, at the time of the paper Linux an=
d
      BSD TCP implementations were unable to provide a timer of sufficient=
=20
      granularity to calculate RTT at less than the system clock frequency.


Understanding TCP Incast Throughput Collapse in Datacenter Networks
- http://conferences.sigcomm.org/sigcomm/2009/workshops/wren/papers/p73.pdf

Proposes an analytical model of limited generality based on the results
observed in two test beds.
  - Observed little benefit from disabling delayed acks

  - Observed a much shallower decline in throughput after 4 servers with 1m=
s
    minRTO vs 200ms minRTO. No benefit was shown for 200us over 1ms. [The=
=20
    next paper concludes that this was because the calculated RTO never wen=
t
    below 5ms, so a 200us minRTO was equivalent to disabling minRTO in this
    setting].

  - For large RTO timer values, reducing the RTO timer value is a first-ord=
er=20
    mitigation. For smaller RTO timer values, intelligently controlling the
    inter-packet wait time [pacing] becomes crucial.

  - Observes two regions of throughput increase. Following the initial=20
    throughput decline there is an increasing region. They reason that: As
    the number of senders increase, 'T' increases, and there is less
    overlap in the RTO periods for different senders. This means
    the impact of RTO events is less severe - a mitigating effect.=20
    (Prob(enter RTO at t) =3D { 1/T : d < t < d + T, 0: otherwise} - d is t=
he=20
    delay for congestion info to propagate back to the sender and T is the=
=20
    width of the uniform distribution in time.)

  - The smaller the RTO timer values, the faster the rate of recovery betwe=
en=20
    the throughput minimum and the second order throughput maximum. For sma=
ller=20
    RTO timer values, the same increase in 'T' will have a larger mitigatin=
g=20
    effect. Hence, as the number of senders increases, the same increase in=
 'T'
    will result in a faster increase in the goodput for smaller RTO timer=
=20
    values.

  - After the second order goodput maximum, the slope of throughput decreas=
e is the=20
    same for different RTO timer values. When 'T' becomes comparable or lar=
ger than
    the RTO timer value, the amount of interference between retransmits aft=
er RTO=20
    and transmissions before RTO no longer depends on the value of the RTO =
timer.


Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communic=
ation
- https://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/ekrevat/docs/SIGCOMMInca=
st.pdf

Effectively makes the case for using high resolution timers to neable micro=
second
granularity TCP timeouts. They claim that they demonstrate that this techni=
que is
effective in avoiding TCP incast collapse in both simulation and real-world=
=20
experiments.
  - Prototype uses Linux's high resolution kernel timers.
  - Demonstrate that this change prevents incast collapse in practice for u=
p
    to 47 senders.
  - Demonstrate that simply reducing RTOmin in today's [2009] TCP=20
    implementations without also improving the timing granularity does not
    prevent TCP incast.
   =20
    - Even without incast patterns, the RTO can determine observed performa=
nce.
      Simple example: They started ten bulk-data transfer TCP flows from te=
n
      clients to one server.  They then had another client issue small
      request packets for 1KB of data from the server, waiting for
      the response before sending the next request.  Approximately
      1% of these requests experienced a TCP timeout, delaying
      the response by at least 200ms. Finer-grained re-transmission handlin=
g
      can improve the performance of latency sensitive applications.

   Evaluating Throughput with Fine-Grained RTO:
     - to be maximally effective timers must operate on a granularity close=
 to
       the RTT of the network.

     - Jacobson RTO Estimation:
       - The standard RTO estimator [V. Jacobson, 98] tracks a smoothed
       =09 estimate of the round-trip time, and sets the timeout to this RT=
T
=09 estimate plus 4 times the mean deviation (a simpler calculation
=09 than the standard deviation, and given a normal distribution of
=09 prediction errors mdev =3D sqrt(pi/2)*sdev).

=09 - RTO =3D SRTT + (4xRTTMDEV)

     =09 - Two factors set lower bounds on the value that the RTO can achie=
ve:
      =20
           - the explicit configuration parameter RTOmin

       =09   - the implicit effects of the granularity with which the RTT i=
s=20
       =09     measured and with which the kernel sets and checks timers.
=09     Most implementations track RTTs and timers at a granularity
=09     of 1ms or larger. Thus the minimum achievable RTO is 5ms.

     - In Simulation (simulate one client with multiple servers connected
       through a single switch with an unloaded RTT of 100us, each node has
       a 1Gbps link, the switch buffers have 32KB of space per output port,
       and a random timer scheduling delay of up to 20us to account for
       real-world variance):
      =20
       - With an RTOmin of 200ms throughput drops by an order of magnitude
       =09 with 8 concurrent senders.

       - Reducing RTOmin to 1ms is effective for 8-16 concurrent senders,
       =09 fully utilizing the client's link. However, throughput declines
=09 as the number of servers is increased. 128 concurrent senders
=09 use only 50% of the available link bandwidth even with a 1ms
=09 RTOmin.

     - In Real Clusters (sixteen node cluster w/ HP Procurve 2848 &=20
       48 node cluster w/ Force10 S50 switch - all nodes 1Gbps and a
       client to server RTT of ~100us):

       - Modified the Linux 2.6.28 kernel to use 'microsecond-accurate'
       =09 timers with microsecond granularity RTT estimation.

       - For all configurations, throughput drops with increasing RTOmin
       =09 above 1ms. For 8 and 16 concurrent senders, the default RTOmin
=09 of 200ms results in nearly 2 orders of magnitude drop in through-
=09 put.

      - Results show identical performance for RTOmin values of 200us and
      =091 ms. Although teh baseline RTTs can be between 50-100us, increase=
d
=09congestion causes RTTs to rise to 400us on average with spikes as=20
=09high as 850us. Thus the higher RTTs combined with increased RTT
=09variance causes the RTO estimator to set timeouts of 1-3ms and an
=09RTOmin below 1ms will not lead to shorter retransmission times.
=09In effect, specifying an RTOmin <=3D 1ms is equivalent to eliminating
=09RTOmin.

   Next-Generation Datacenters:

     - 10Gbps networks have smaller RTTs than 1Gbps - port-to-port latency
       can be as low as 10us. In a sampling of an active storage node at=20
       LANL 20% of RTTs are belowe 100us even when accounting for kernel
       scheduling.
     - smaller RTO values are required to avoid idle link time.=20

     - Scaling to Thousands [simulating large numbers of servers on a 10Gbp=
s network]
       (reduce baseline RTTs from 100us to 20us, eliminate 20us timer sched=
uling
       variance, increase link capacity to 10Gbps, set per-port buffer size=
 to 32KB,
       increase blocksize to 80MB to ensure each flow can saturate a 10Gbps=
 link,=20
       vary the number of servers from 32 to 2048):
      =20
       - Having an artificial bound of either 1ms or 200us results in low t=
hroughput
       =09 in a network whose RTTs are 20us - underscoring the requirement =
that=20
=09 retransmission timeouts should be on the same timescale as network late=
ncy
=09 to avoid incast collapse.

       - Eliminating a lower bound on RTO performs well for up to 512 concu=
rrent
       =09 senders. For 1024 servers and beyond, even the aggressively low =
RTO
=09 configuration sees up to a 50% reduction in throughput resulting from
=09 significant periods of link idle time caused by repeated, simultaneous,
=09 successive timeouts.
=09=20
=09 - For incast communication the standard exponential backoff increase of
=09   RTO can overshoot some portion of the time the link is actually idle.
=09   Because only one flow must overshoot to delay the entire transfer,=20
=09   the probability of overshooting increases with increased number of
=09   flows.
=09 - Decreased throughput for a large number of flows can be attributed to
=09   many flows timing out simultaneously, backing off deterministically,
=09   and retransmitting at the same time. While some flows are successful
=09   on this retransmission, a majority of flows lose their retransmitted
=09   packet and backoff by another factor of two, sometimes far beyond
=09   when the link becomes idle.

     - Desynchronizing Retransmissions
      =20
       - Adding some randomness to the RTO will desynchronize retransmissio=
ns.
      =20
       - Adding an adaptive randomize RTO to the scheduled timeout:

       =09 timeout =3D (RTO + (rand(0.5) x RTO)) x 2^backoff

=09 performs well regardless of the number of concurrent senders.=20
=09 Nonetheless, real-world variances my be large enough to avoid the
=09 need for explicit randomization in practice.

      - Do not evaluate the impact on wide area flows.

     - Implementing fine-grained retransmissions
      =20
       - Three changes to the Linux TCP stack were required:
       =09=20
=09 - microsecond resolution time accounting to track RTTs with greater
=09   precision - store microseconds in the TCP timestamp option=20
=09   [timestamp resolution can go as high as 57ns without violating the
=09   requirements of PAWS]

=09 - redefinition of TCP constants - timer constants formerly defined in=
=20
=09   terms of jiffies [ticks] are converted to absolute values (e.g. 1ms=
=20
=09   instead of 1 jiffy)
=09 =09=20
=09 - replacement of low-resolution timers with hrtimers - replace standard
=09   timer objects in the socket structure with the hrtimer structure,
=09   ensuring that all calls to set, reset, or clear timers use the
=09   hrtimer functions.

       - Results:
       =09=20
=09 - Using the default 200ms RTOmin throughput plummets beyond 8
=09   concurrent senders on both testbeds.

=09 - On the 16 server testbed a 5ms jiffy-based RTOmin throughput begins=
=20
=09   to drop at 8 servers to ~70% of link capacity and slowly decreases=20
=09   thereafter. On the 47 server testbed [Force10 switch] the 5ms=20
=09   RTOmin kernel obtained 70-80% throughput with a substantial
=09   decline after 40 servers.
=09  =20
=09 - TCP hrtimer implementation / microsecond RTO kernel is able to
=09   saturate the link for up to 16/47 servers [total number in=20
=09   both testbeds].

       - Implications of Fine-Grained TCP Retransmissions:

       =09 - A receiver's delayed ACK timer should always fire before the s=
enders
=09   retransmission timer fires to prevent the sender form timing out
=09   waiting for an ACK that is merely delayed. Current system protect
=09   against this by setting the delayed ACK timer to a value (40ms)
=09   that is safely under the RTOmin (200ms).
=09=20
=09- A host with microsecend granularity retransmissions would periodically
=09  experience an unnecessary timeout when communicating with unmodified
=09  hosts in environments where the RTO is below 40ms (e.g., in the data
=09  center and for short flows in the WAN), because the sender incorrectly
=09  assumes that a loss has occurred. In practice the two consequences
=09  are mitigated by newer TCP features and the limited circumstances in
=09  which they occur (and bulk data transfer is essentially unimpacted by=
=20
=09  the issue).

=09  - The major potential effect of a spurious timeout is a loss of
=09    performance:  a flow that experiences a timeout will reduce
=09    its slow-start threshold (ssthresh) by half, its window to one
=09    and attempt to rediscover link capacity.  It is important to
=09    understand that spurious timeouts do not endanger network
=09    stability through increased congestion [On estimating end-to-end
=09    network path properties. SIGCOMM 99]. Spurious timeouts
=09    occur not when the network path drops packets, but rather when=20
=09    the path observers a sudden, higher delay.
=09=20
=09 - Several algorithms have been proposed to undo the effects of spurious
=09   timeouts have been proposed and, in the case of F-RTO [Forward=20
=09   RTO-Recovery RFC 4138], adopted in the Linux TCP implementation.

       - When seeding torrents over a WAN there was no observable differenc=
e
       =09 in performance between the 200us and 200ms RTOmin [no penalty].

       - Interaction with Delayed ACK in the Datacenter: For servers using =
a
       =09 reduced RTO in  a  datacenter  environment, the server's retrans=
mission=20
=09 timer may expire long before an unmodied client's 40ms delayed ACK time=
r
=09 expires. As a result, the server will timeout and resend the unacked
=09 packet, cutting ssthresh in half and rediscovering link capacity using
=09 slow-start. Because the client acknowledges the retransmitted segment=
=20
=09 immediately, the server does not observe a coarse-grained 40ms delay,=
=20
=09 only an unnecessary timeout.

       - Although for full performance delayed acks should be disabled, unm=
odified
       =09 clients still achieve good performance and avoid incast when onl=
y the
=09 servers implement fine-grained retransmissions.

Data Center Transmission Control Protocol (DCTCP):
The Microsoft & Stanford developed CC protocol uses simplified switch RED/E=
CN CE marking to=20
provide fine grained congestion notification to senders. RED is enabled in =
the switch but
minth=3Dmaxth=3DK, where K is an empirically determined constant that is a =
function of bandwidth
and desired switch utilization vs rate of convergence. Common values for K =
are 5 for 1Gbps
and 60 for 10Gbps. The value for 40Gbps is presumably on the order of 240. =
The sender's=20
congestion window is scaled back once per RTT as function of (#ECE/(#segmen=
ts in window))/2.
In the degenerate case of all segments being marked window is scaled back a=
 la a loss in
Reno. In the steady state latencies are much lower than in Reno due to cons=
iderably reduced
switch occupancy.=20

There is currently no mechanism for negotiating CC protocols and DCTCP's re=
liance on continuous
ECE notifications is incompatible with ECN's continuous repeating of the sa=
me ECE until a CWR
is received. In effect ECN support has to be sucessfully negotiated when es=
tablishing the=20
connection, but the receiver has to instead provide one ECE per new CE seen=
.=20

RFC:
Datacenter TCP (DCTCP): TCP Congestion Control for Datacenters
https://tools.ietf.org/pdf/draft-ietf-tcpm-dctcp-00.pdf


The window scaling constant is referred to as 'alpha'. Alpha=3D0 correspond=
s
to no congestion, alpha=3D1 corresponds to a loss event in Reno or an ECE m=
ark in standard
ECN  - resulting in a halving of the congestion window. 'g' is the feedback=
 gain, 'M' is the=20
fraction of bytes marked to bytes sent. Alpha and the congestion window 'cw=
nd' are calculated
as follows:

alpha =3D alpha * (1 - g) + g * M

cwnd =3D cwnd * (1 - alpha/2)

To cope with delayed acks DCTCP specifies the following state machine - CE =
refers to DCTCP.CE,=20
a new Boolean TCP state variable, "DCTCP Congestion Encountered" - which is=
 initialized to=20
false and stored in the Transmission Control Block (TCB).

=20
                                  Send immediate
                                  ACK with ECE=3D0
                        .----.    .-------------.     .---.
           Send 1 ACK  /     v    v             |    |     \
            for every |     .------.           .------.     | Send 1 ACK
            m packets |     | CE=3D0 |           | CE=3D1 |     | for every
           with ECE=3D0 |     =E2=80=99------=E2=80=99           =E2=80=99-=
-----=E2=80=99     | m packets
                       \     |    |             ^    ^     /  with ECE=3D1
                        =E2=80=99---=E2=80=99      =E2=80=99------------=E2=
=80=99    =E2=80=99----=E2=80=99
                                   Send immediate
                                   ACK with ECE=3D1

The clear implication of this is that if the ack is delayed by more than m,=
 as in different
assumptions between peers or dropped ACKs, the signal can underestimate the=
 level of encountered=20
congestion. None of the literature suggests that this has been a problem in=
 practice.

[Section 3.4 of RFC]
Handling of SYN, SYN-ACK, RST Packets
   [RFC3168] requires that a compliant TCP MUST NOT set ECT on SYN or
   SYN-ACK packets.  [RFC5562] proposes setting ECT on SYN-ACK packets,
   but maintains the restriction of no ECT on SYN packets.  Both these
   RFCs prohibit ECT in SYN packets due to security concerns regarding
   malicious SYN packets with ECT set.  These RFCs, however, are
   intended for general Internet use, and do not directly apply to a
   controlled datacenter environment.  The switching fabric can drop TCP
   packets that do not have the ECT set in the IP header.  If SYN and
   SYN-ACK packets for DCTCP connections do not have ECT set, they will
   be dropped with high probability.  For DCTCP connections, the sender
   SHOULD set ECT for SYN, SYN-ACK and RST packets.

[Section 4]
Implementation Issues
- the implementation must choose a suitable estimation gain (feedback gain)
  - [DCTCP10] provides a theoretical basis for its selection, in practice
    more practical to select empirically by network/workload
  - The Microsoft implementation uses a fixed estimation gain of 1/16

- the implementation must decide when to use DCTCP. DCTCP may not be=20
  suitable or supported for all peers.

- It is  RECOMMENDED that the implementation deal with loss episodes in
   the same way as conventional TCP.

- To prevent incast throughput collapse, the minimum RTO (MinRTO) should be=
=20
  lowered significantly. The default value of MinRTO in Windows is 300ms,=
=20
  Linux 200ms, and  FreeBSD 233ms. A lower MinRTO requires a correspondingl=
y=20
  lower delayed ACK timeout on the receiver. Thus, it is RECOMMENDED that a=
n=20
  implementation allow configuration of lower timeouts for DCTCP connection=
s.

- It is also RECOMMENDED that an implementation allow configuration of=20
  restarting the congestion window (cwnd) of idle DCTCP connections as desc=
ribed=20
  in [RFC5681].

-  [RFC3168] forbids the ECN-marking of pure ACK packets, because of the
   inability of TCP to mitigate ACK-path congestion and protocol-wise
   preferential treatment by routers.  However, dropping pure ACKs -
   rather than ECN marking them - has disadvantages for typical
   datacenter traffic patterns. Dropping of ACKs causes subsequent re-
   transmissions.  It is RECOMMENDED that an implementation provide a=20
   configuration knob that forces ECT to be set on pure ACKs.

[Section 5]
Deployment Issues
-  DCTCP and conventional TCP congestion control do not coexist well in
   the same network.  In DCTCP, the marking threshold is set to a very
   low value to reduce queueing delay, and a relatively small amount of
   congestion will exceed the marking threshold.  During such periods of
   congestion, conventional TCP will suffer packet loss and quickly and
   drastically reduce cwnd.  DCTCP, on the other hand, will use the
   fraction of marked packets to reduce cwnd more gradually.  Thus, the
   rate reduction in DCTCP will be much slower than that of conventional
   TCP, and DCTCP traffic will gain a larger share of the capacity
   compared to conventional TCP traffic traversing the same path. It is
   RECOMMENDED that DCTCP traffic be segregated from conventional TCP traff=
ic.
   [MORGANSTANLEY] describes a deployment that uses the IP DSCP bits to=20
   segregate the network such that AQM is applied to DCTCP traffic, whereas=
=20
   TCP traffic is managed via drop-tail queueing.

-  Since DCTCP relies on congestion marking by the switches, DCTCP can
   only be deployed in datacenters where the entire network
   infrastructure supports ECN.  The switches may also support
   configuration of the congestion threshold used for marking.  The
   proposed parameterization can be configured with switches that
   implement RED.  [DCTCP10] provides a theoretical basis for selecting
   the congestion threshold, but as with the estimation gain, it may be
   more practical to rely on experimentation or simply to use the
   default configuration of the device.  DCTCP will degrade to loss-
   based congestion control when transiting a congested drop-tail link.

-  DCTCP requires changes on both the sender and the receiver, so both
   endpoints must support DCTCP.  Furthermore, DCTCP provides no
   mechanism for negotiating its use, so both endpoints must be
   configured through some out-of-band mechanism to use DCTCP.  A
   variant of DCTCP that can be deployed unilaterally and only requires
   standard ECN behavior has been described in [ODCTCP][BSDCAN], but
   requires additional experimental evaluation.

[Section 6]
Known Issues

-  DCTCP relies on the sender=E2=80=99s ability to reconstruct the stream o=
f CE
   codepoints received by the remote endpoint.  To accomplish this,
   DCTCP avoids using a single ACK packet to acknowledge segments
   received both with and without the CE codepoint set.  However, if one
   or more ACK packets are dropped, it is possible that a subsequent ACK
   will cumulatively acknowledge a mix of CE and non-CE segments.  This
   will, of course, result in a less accurate congestion estimate.

   o  Even with an inaccurate congestion estimate, DCTCP may still
      perform better than [RFC3168].
   o  If the estimation gain is small relative to the packet loss rate,
      the estimate may not be too inaccurate.
   o  If packet loss mostly occurs under heavy congestion, most drops
      will occur during an unbroken string of CE packets, and the
      estimate will be unaffected

- The effect of packet drops on DCTCP under real world conditions has not b=
een
  analyzed.

-  Much like standard TCP, DCTCP is biased against flows with longer
   RTTs.  A method for improving the fairness of DCTCP has been proposed
   in [ADCTCP], but requires additional experimental evaluation.


Papers:
Data Center TCP [DCTCP10]
- http://research.microsoft.com/en-us/um/people/padhye/publications/dctcp-s=
igcomm2010.pdf

The original DCTCP SIGCOMM paper by Stanford and Microsoft Research. It is =
very accessible
even for those of us not well versed in CC protocols.

 - reduce minRTO to 10ms.
 - suggest that K > (RTT * C)/7, where C is the sending rate in packets per=
 second.


Attaining the Promise and Avoiding the Pitfalls of TCP=20
in the Datacenter [MORGANSTANLEY]
- https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-judd.p=
df

Real world experience deploying DCTCP on Linux at Morgan Stanley.

  - reduce minRTO to 5ms.
  - reduce delayed ACK to 1ms.
  - Only ToR switches support ECN marking, higher level switches purely tai=
l-drop.
    Tests show that DCTCP successfully resorts to loss-based congestion con=
trol when
    transiting a congested drop-tail link.
  - Find that setting ECT on SYN and SYN-ACK is critical for the practical=
=20
    deployment of DCTCP. Under load, DCTCP would fail to establish network=
=20
    connections in the absence of ECT in SYN and SYN-ACK packets. (DCTCP+)
  - Without correct receive buffer tuning DCTCP will converge _faster_ than=
 TCP,
    rather than the theoretical 1.4 x TCP.

Per-packet latency in ms
=09   TCP=09   DCTCP+
Mean=09   4.01=09   0.0422
Median=09   4.06=09   0.0395
Maximum=09   4.20=09   0.0850
Minimum=09   3.32=09   0.0280
sigma=09   0.167   0.0106


Extensions to FreeBSD Datacenter TCP for Incremental
Deployment Support [BSDCAN]
- https://www.bsdcan.org/2015/schedule/attachments/315_dctcp-bsdcan2015-pap=
er.pdf

Proposes a variant of DCTCP that can be deployed only on one endpoint of a =
connection,
provided the peer is ECN-capable.
ODTCP changes:
  - In order to facilitate one-sided deployment, a DCTCP
    sender should set the CWR mark after receiving an ECE-
    marked ACK once per RTT. It is safe in two-sided deploy-
    ments, because a regular DCTCP receiver will simply ig-
    nore the CWR mark.=20
  - A a one-sided DCTCP receiver should always delay an ACK for=20
    incoming packets marked with CWR, which is the only indication
    of recovery exit.
DCTCP improvements:
  - ECE processing: Under standard ECN an ACK with an ECE mark will
    trigger congestion recovery. When this happens a sender stops
    increasing cwnd for one RTT. For DCTCP there is no reason for
    this response. ECEs are used, not for detecting congestion=20
    events, but to quantify the extent of congestion and react=20
    proportionally. Thus, there is no need to stop cwnd from in-
    creasing.=20
  - Set initial value of alpha to 0 (i.e. don't halve cwnd on first
    ECE seen).
  - Idle Periods: The same tradeoffs regarding "slow-start restart"
    apply to alpha. The FreeBSD implementation re-initializes alpha
    after an idle period longer than the RTO.
  - Timeouts and Packet Loss: The DCTCP specification defines the
    update interval for alpha as one RTT. To track this DCTCP compares
    received ACKs against the sequence numbers of outgoing packets.
    This is not robust in the face of packet loss. The FreeBSD=20
    implementation addresses this by updating alpha when it detects
    duplicate ACKs or timeouts.  =20


Data Center TCP (DCTCP)
- http://www.ietf.org/proceedings/80/slides/iccrg-3.pdf

Case studies, workloads, latency and flow completion time of TCP vs DCTCP.
Interesting set of slides worth skimming.
   - Small (10-100KB & 100KB - 1MB) background flows complete in ~45% less=
=20
     time than TCP.
   - 99th %ile & 99.9th %ile query flows are 2/3rds and 4/7ths respectively
   - large (1-10MB & > 10MB) flows unchanged
   - query completion time with 10 to 1 background incast unchanged with=20
     DCTCP, ~5x slower with TCP


Analysis of DCTCP: Stability, Convergence, and Fairness [ADCTCP]
- http://sedcl.stanford.edu/files/dctcp-analysis.pdf
Follow up mathematical analysis of DCTCP using a fluid model. Contains=20
interesting graphs showing how the gain factor affects the convergence rate
between two flows.
   - Analyzes the convergence of DCTCP sources to their fair share, obtaini=
ng
     an explicit characterization of the convergence rate.
   - Proposes a simple change to DCTCP suggested by the fluid model which=
=20
     significantly improves DCTCP's RTT-fairness. It suggests updating the=
=20
     congestion window continuously rather than once per RTT.
   - Finds that with a marking threshold, K, of about 17% of the bandwidth-
     delay product, DCTCP achieves 100% throughput, and that even for value=
s=20
     of K as small as 1% of the bandwidth-delay product, its throughput is=
=20
     at least 94%.
   - Show that DCTCP's convergence rate is no more than a factor 1.4 slower=
 than=20
     TCP


Using Data Center TCP (DCTCP) in the Internet [ADCTCP]
- http://www.ikr.uni-stuttgart.de/Content/Publications/Archive/Wa_GLOBECOM_=
14_40260.pdf
Investigates what would be needed to deploy DCTCP incrementally outside the=
 data
center.
   - Proposes finer resolution for alpha value
   - Allow the congestion window to grow in the CWR state (similar to [BSDC=
AN])
   - Continuous update of alpha: Define a smaller gain factor (1/2^8 instea=
d of 1/2^4)
     to permit an EWMA updated every packet. However, g should actually be =
a function
     of number of packets in flight.
   - Progressive congestion window reduction: Similar to [ADCTCP], reduce t=
he congestion
     window on the reception of each ECE.

   - develops a formula for AQM RED parameters that always results in equal=
 sharing
     between DCTCP and non-DCTCP.


Incast Transmission Control Protocol (ICTCP):
In ICTCP the receiver plays a direct role in estimating the per-flow availa=
ble bandwidth
and actively re-sizes each connection's receive window accordingly.

- http://research.microsoft.com/pubs/141115/ictcp.pdf


Quantum Congestion Notification (QCN):
Congestion control in ethernet. Introduced as part of the IEEE 802.2 Standa=
rds=20
Body discussions for Data Center Bridging [DCB] motivated by the needs of F=
CoE.=20
The initial congestion control protocol was standardized as 802.1Qau. Unlik=
e=20
the single bit of congestion information per-packet in TCP QCN uses 6-bits.

The algorithm is composed of two main parts: Switch or Control Point (CP)=
=20
Dynamics and Rate Limiter or Reaction Point (RP) Dynamics.

  - The CP Algorithm runs at the network nodes. Its objective is to maintai=
n the
    node's buffer occupancy at the operating point 'Beq'. It computes a con=
-
    gestion measure Fb and randomly samples an incoming packet with a proba=
bility=20
    proportional to the severity of the congestion. The node sends a 6-bit=
=20
    quantized value of Fb back to the source of the sampled packet.
   =20
    - B: Value of the current queue length
    - Bold: Value of the buffer occupancy when the last feedback message wa=
s=20
      generated.
    - w: a non-negative constant, equal to 2 for the baseline implementatio=
n
    - Boff =3D B - Beq
    - Bd =3D B - Bold
    - Fb =3D Boff + w*Bd
      - essentially equivalent to the PI AQM. The first term is the offset
      =09from the target operating point and the second term is proportiona=
l
=09to the rate at which the queue size is changing.

     When Fb < 0, there is no congestion, and no feedback messages are sent=
.
     When Fb >=3D 0, then either the buffers or the link is oversubscribed,=
 and
     control action needs to be taken.

   - The RP algorithm runs on end systems (NICs) and controls the rate at w=
hich
     ethernet packets are transmitted. Unlike TCP, the RP algorithm does no=
t
     get positive ACKs from the network and thus needs alternative mechanis=
ms
     for increasing its sending rate.
    =20
     - Current Rate (Rc): The transmission rate of the source
     - Target Rate (Rt): The transmission rate of the source just before th=
e=20
       arrival of the last feedback message
     - Gain (Gd): a constant chosen so that Gd*|Fbmax| =3D 1/2 - that is to=
 say
       the rate can decrease by at most 50%. Only 6 bits are available for
       feedback so Fbmax =3D 64, and thus Gd =3D 1/128.
     - Byte counter: A counter at the RP for counting transmitted bytes; us=
ed
       to time rate increases
     - Timer: A clock at the RP used for timing rate increases.

     Rate Decreases:
     A rate decrease is only done when a feedback message is received:
       - Rt <- Rc
       - Rt <- Rc*(1 - Gd*|Fb|)=20

     Rate Increases:
     Rate Increase is done in two phases: Fast Recovery and Active Increase=
.

       Fast Recovery (FR): The source enters the FR state immediately after=
 a
       rate decrease event - at which point the Byte Counter is reset. FR
       consists of 5 cycles, in each of which 150KB of data (assuming full-
       sized regular frames) are transmitted (100 packets of 1500 bytes eac=
h),
       as counted by the Byte Counter. At the end of each cycle, Rt remains
       unchanged, and Rc is updated as follows:
       =09=09 =20
=09=09  Rc <- (Rc + Rt)/2
=09The rationale being that, when congested, Rate Decrease messages are
=09sent by the CP once every 100 packets. Thus the absence of a Rate
=09Decrease message during this interval indicates that the CP is no
=09longer congested.

      Active Increase (AI): After 5 cycles of FR, the source enters the AI
      state when it probes for extra bandwidth. AI consists of multiple
      cycles of 50 packets each. Rt and Rc are updated as follows:      =09=
 =20
      =09 - Rt <- Rt + Rai
=09 - Rc <- (Rc + Rt)/2
=09 - Rai: a constant set to 5Mbps by default.
      When Rc is extremely small after a rate decrease the time required to
      send out 150 KB can be excessive. To increase the rate of increase
      the source also uses a timer that is used as follows:=20
      =09 1) reset timer when rate decrease message arrives
=09 2) source enters FR and counts out 5 cycles of T ms duration
=09    (T =3D 10ms in baseline implementation), and in the AI state,
=09    each cycle is T/2 ms long
=09 3) in the AI state, Rc is updated when _either_ the Byte Counter
=09    or the Timer completes a cycle.
=09 4) The source is is in teh AI state iff either the Byte Counter
=09    or the timer is in teh AI state.
=09 5) if _both_ the Byte Counter and the Timer ar in AI the source is
=09    said to be in Hyper-Active Increase (HAI). In this case, at the
=09    completion of the ith Byte Counter and Timer cycle, Rt and Rc
=09    are updated:
=09     - Rt <- Rt + i*Rhai
=09     - Rc <- (Rc + Rt) / 2
=09     - Rhai: 50Mbps in the baseline

[Taken from "Internet Congestion Control" by Subir Varna, ch. 8]


Performance of Quantized Congestion Notification in TCP Incast Scenarios of=
=20
Data Centers
- http://eprints.networks.imdea.org/131/1/Performance_of_Quantized_Congesti=
on_Notification_-_2010_EN.pdf

Using the QCN pseudocode version released by Rong Pan [IEEE EDCS-608482]=20
simulated the performance of QCN at 1Gbps under a number of incast scenario=
s,
reaching the conclusion that the the default QCN behaviors will not scale
to large number of flows with full link utilization. It goes on to propose
a small number of changes to the QCN algorithm that _will_ support a large
number of flows at full link utilization. However, there is no indication i=
n the
literature that these ideas have been taken any further in practice. A surv=
ey
paper written in 2014 [A Survey on TCP Incast in Data Center Networks] indi=
cates
that these problems still exist. It is unclear what the current state of th=
e art
is in shipping hardware.


http://www.ieee802.org/3/ar/public/0505/bergamasco_1_0505.pdf
http://www.ieee802.org/1/files/public/docs2007/au-bergamasco-ecm-v0.1.pdf
http://www.cs.wustl.edu/~jain/papers/ftp/bcn.pdf
http://www.cse.wustl.edu/~jain/papers/ftp/icc08.pdf


Recommendations:=20

RFC 6298:=20

    - change starting RTO from 3s to 1s=09 (in /dctcp)=09=09=09=09D4294

    - DO NOT round RTO up to 1s counter to the suggestions here (long done)

    - simplify setting of minRTO sysctl to eliminate "slop" component"=09=
=09D4294
       (in /dctcp)

RFC 6928:

    - increase initial / idle window to 10 segments when connecting to=09(d=
one by hiren)
      data center peers


RFC 7323:

    - stop truncating SRTT prematurely on low-latency conections,=09=09D429=
3
      see appendix G to calculate reduce potentially detrimental
      fluctuations in calculated RTO


Incast:

 - do SW TSO only

 - add rudimentary pacing by interleaving streams

 - fine grained timers=09=09=09=09=09=09=09=09D4292

 - scale RTO down to same granularity as RTT=09(patch in progres)

ECN:
 - change default to allow ECN on incoming connections

 - set ECT on _ALL_ packets sent by a host using a DCTCP connection
=20
 - add facility to enable ECN by subnet


DCTCP:

 - add facility to enable DCTCP by subnet

 - set ECT on _ALL_ packets used by a host using a DCTCP connection

 - update TCP to use microsecond granularity timers to timestamps (patch in=
 progress)

 - when using current coarse-grained timers reduce minRTO to 3ms=09=09D4294
    when using DCTCP, if fine-grained timers are available disable
   minRTO when using DCTCP

 - reduce delack to 1/5th of min(minRTO, RTO) (reduced to 1/2 in /dctcp)=09=
D4294


ICTCP:

  - if there is time investigate it's use and the ability to use
    the socket buffer sizing to communicate the amount of anticipated
    data for purposes of TCB's sharing the port's connection optimally=09