Date: Tue, 27 Aug 1996 16:04:48 +0200 (MET DST) From: Luigi Rizzo <luigi@labinfo.iet.unipi.it> To: hackers@freebsd.org Cc: luigi@labinfo.iet.unipi.it (Luigi Rizzo), olah@freefall.freebsd.org, phk@freebsd.org, jkh@freebsd.org Subject: SACK and other TCP modifications available Message-ID: <199608271404.QAA06501@labinfo.iet.unipi.it>
next in thread | raw e-mail | index | archive | help
It would be nice if someone could try some of the changes below and send some feedback, and possibly arrange to include this in -current sources. Luigi ----------- The file "sack.diffs", available from http://www.iet.unipi.it/~luigi/research.html includes a number of modifications to TCP designed to improve performance in presence of losses, namely: - MODIFIED FAST RETRANSMIT - NEWRENO - SACK (Selective Acknowledgements) - TSACK (Selective Acknowledgements in RFC1323 timestamps) The software is in alpha stage, although it has been running for a couple of weeks in intermediate formats, and it is running on a couple of our systems since Aug.26 1996. MODIFIED FAST RETRANSMIT is really helpful on lossy links, and does not need modifications at the receive side. Same for NEWRENO. SACK (and/or TSACK), especially if sided by MODIFIED FAST RETRANSMIT, can improve throughput dramatically. The diffs are against FreeBSD 2.1R, although they should be easily ported to other BSD-derived systems. Most options must be enabled in the kernel config file via option SPECIFICOPTIONNAME and need to be enabled via a sysctl variable in order to activate them. Since this code is evolving rapidly, please check the above URL to see if there is a newer version. In particular, this code still has some diagnostic output which goes to /var/log/messages. Bugs, fixes and suggestions can be reported to me at luigi@iet.unipi.it --------------------------- A brief description of all changes follows: CLEANUP OF BSD CODE + BSD code has some strange ways of updating the count of duplicate acks. The count gets reset by some unexpected events (window updates, old segments) and does not get checked/reset properly in the header prediction code. A number of small fixes tries to count dupacks more consistently. + added a flag, TF_FAST_RXMT, to indicate that we are in fast retransmit/fast recovery. This is needed to support a different fast retransmit policy, and makes the code somewhat easier to read. + the count of retransmitted and dup bytes is now accumulated per connection as well as globally. This is useful for statistical purposes, and can be used later to determine if a connection is experiencing losses or duplicate data. + additional variables are added to tcpstat, to count for various events. MODIFIED FAST RETRANSMIT + BSD enters fast retransmit when there are 3 consecutive duplicate acks; the number 3 was chosen to reduce the chance that a reordering of packets in the net is seen as a segment loss. However, in presence of large losses, or when the amount of outstanding data is small, or the window is narrow, there are so few packets in transit that 3 dupacks cannot happen, and the chance of a reordering is low. In these situations, 1 or 2 dupacks almost certainly mean that a segment has been lost. Instead of waiting for a timeout, fast retransmit can be started earlier. This code identifies these cases, and lowers the threshold for fast retransmit to 2 or 1 dup. Note 1: in many cases (e.g. telnet, http), there are still a lot of timeouts which occur after 0 dupacks, because in many cases there is only one segment in flight. We cannot do much on this. Note 2: since the tcp control block accumulates statistics on the amount of dup/retransmitted data, perhaps this behaviour can be made more adaptive if the connection shows a significant reordering of segments. NEWRENO (following a suggestion by J. Hoe) + In Reno, after a fast retransmit, a non-dup ack causes exit from fast recovery. However, in case of multiple losses in the same window, there might need three more dupacks to detect this, and a subsequent fastretrans would shrink the window even further. We save the value of snd_max in snd_max_rxmt at the time of the fast retransmit; then if snd_una does not advance to snd_max_rxmt the segment at snd_una has been lost and can be retransmitted immediately. SACK + This is an implementation of the SACK options as described in the recent internet draft, to which it is fully compliant. The maximum lifetime of SACK can be set to 0 or more timeouts. The retransmission strategy, during fast recovery, is as follows: if new data can be sent within snd_wnd and snd_cwnd, then do it. Otherwise, old blocks (up to, but not beyond, the last SACKed block) are sent again. There is currently no provision to resent the block snd_una if this has been lost twice (a solution is in the works). TSACK + This is a simplified version of SACK, which carries SACK information embedded in slightly modified RFC1323 timestamps. There are some tradeoffs in using TSACKs (almost no need for receiver support, less precise SACKs) instead of ACKs, but TSACKs have some advantage over SACKs in some cases. ARTIFICIAL LOSSES + in order to test the behaviour of the above code, there is a new function, tcp_dropit(), which allows some incoming data and ack packets to be dropped. Currently the drop rate is 10% for data segment, 5% for pure acks. Segments are dropped using a repetitive pattern of 499 segments, in order to make results a bit more reproducible (they aren't reproducible anyways, because the actual generation of ACKs depends on the behaviour of the receiver process and there is some interaction with timeouts). All the above mechanisms can be enabled by setting the variable net.inet.tcp.sack as follows: SACK lifetime 0..15 (0 and 1 are equivalent) SACK 0x10 enables sack negotiation and processing TSACK 0x20 enables TSACK generation MODIFIED_FR 0x40 enables modified fast retransmit NEWRENO 0x80 enables newreno LOSSY 0x100 enables dropping incoming data/acks The following kernel options are needed: option TSACK enables TSACK generation option SACK enables SACK code, TSACK processing, LOSSY Newreno and modified fast retransmit are compiled in by default. You might also need the following changes to sysctl and netstat. The former needs to be recompiled with the new tcp_var.h The patch below just allows you to enter values as hex numbers instead of decimal ones. The patch to netstat (which also needs to be recompiled) is there to allow you to see the additional statistic variables in the tcpstat structure. Since these variables are allocated at the bottom of the structure, older netstat will work, just don't write all available info. diff -cbwr /usr.sbin/sysctl/sysctl.c ./sysctl.c *** /cdrom/usr/src/usr.sbin/sysctl/sysctl.c Sun Jun 11 06:32:58 1995 --- ./sysctl.c Mon Aug 19 16:28:31 1996 *************** *** 342,348 **** if (newsize > 0) { switch (type) { case CTLTYPE_INT: ! intval = atoi(newval); newval = &intval; newsize = sizeof intval; break; --- 342,349 ---- if (newsize > 0) { switch (type) { case CTLTYPE_INT: ! sscanf(newval, "%i", &intval); /* XXX */ ! /* intval = atoi(newval); */ newval = &intval; newsize = sizeof intval; break; diff -cbwr netstat/inet.c /usr/src/usr.bin/netstat/inet.c *** netstat/inet.c Sat Jul 29 11:42:54 1995 --- /usr/src/usr.bin/netstat/inet.c Fri Aug 23 17:02:49 1996 *************** *** 227,233 **** --- 227,243 ---- p(tcps_conndrops, "\t%d embryonic connection%s dropped\n"); p2(tcps_rttupdated, tcps_segstimed, "\t%d segment%s updated rtt (of %d attempt%s)\n"); + p(tcps_zerodupw, "\t%d invalid invalid dupack reset on window update\n"); p(tcps_rexmttimeo, "\t%d retransmit timeout%s\n"); + p(tcps_rexmt[0], "\t\t%d retransmit timeout with 0 dup acks\n"); + p(tcps_rexmt[1], "\t\t%d retransmit timeout with 1 dup acks\n"); + p(tcps_rexmt[2], "\t\t%d retransmit timeout with 2 dup acks\n"); + p(tcps_fastretransmit, "\t%d fast retransmit%s\n"); + p(tcps_fastrexmt[0], "\t\t%d with 1 dup ack\n"); + p(tcps_fastrexmt[1], "\t\t%d with 2 dup ack\n"); + p(tcps_fastrexmt[2], "\t\t%d with 3 dup ack\n"); + p(tcps_newreno, "\t%d newreno retrans\n"); + p(tcps_fastrecovery, "\t%d fast recovery\n"); p(tcps_timeoutdrop, "\t\t%d connection%s dropped by rexmit timeout\n"); p(tcps_persisttimeo, "\t%d persist timeout%s\n"); p(tcps_persistdrop, "\t\t%d connection%s dropped by persist timeout\n");
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199608271404.QAA06501>