From owner-freebsd-net@FreeBSD.ORG  Tue Feb  8 17:02:03 2011
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 73DEB10656C4
	for <net@freebsd.org>; Tue,  8 Feb 2011 17:02:03 +0000 (UTC)
	(envelope-from jhb@freebsd.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 3E3AF8FC15
	for <net@freebsd.org>; Tue,  8 Feb 2011 17:02:03 +0000 (UTC)
Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net
	[66.111.2.69])
	by cyrus.watson.org (Postfix) with ESMTPSA id DD70246B03
	for <net@freebsd.org>; Tue,  8 Feb 2011 12:02:02 -0500 (EST)
Received: from jhbbsd.localnet (unknown [209.249.190.10])
	by bigwig.baldwin.cx (Postfix) with ESMTPSA id 13B278A01D
	for <net@freebsd.org>; Tue,  8 Feb 2011 12:01:59 -0500 (EST)
From: John Baldwin <jhb@freebsd.org>
To: net@freebsd.org
Date: Tue, 8 Feb 2011 12:01:54 -0500
User-Agent: KMail/1.13.5 (FreeBSD/7.4-CBSD-20110107; KDE/4.4.5; amd64; ; )
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="us-ascii"
Content-Transfer-Encoding: 7bit
Message-Id: <201102081201.54250.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.6
	(bigwig.baldwin.cx); Tue, 08 Feb 2011 12:01:59 -0500 (EST)
X-Virus-Scanned: clamav-milter 0.96.3 at bigwig.baldwin.cx
X-Virus-Status: Clean
X-Spam-Flag: YES
X-Spam-Status: Yes, score=6.8 required=4.2 tests=BAYES_00,MAY_BE_FORGED,
	RDNS_DYNAMIC, TO_NO_BRKTS_DIRECT,
	TO_NO_BRKTS_DYNIP autolearn=no version=3.3.1
X-Spam-Report: * -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
	*      [score: 0.0000]
	*  1.0 RDNS_DYNAMIC Delivered to internal network by host with
	*      dynamic-looking rDNS
	*  1.4 MAY_BE_FORGED Relay IP's reverse DNS does not resolve to IP
	*  2.6 TO_NO_BRKTS_DIRECT To: misformatted and direct-to-MX
	*  3.7 TO_NO_BRKTS_DYNIP To: misformatted and dynamic rDNS
X-Spam-Level: ******
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on bigwig.baldwin.cx
Cc: 
Subject: A small TCP bug: excessive duplicate ACKs
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 08 Feb 2011 17:02:03 -0000

One thing I've noticed at work is that if a receiver's socket buffer fills and 
the receiver then drains the buffer all at once, we send a lot of duplicate 
ACKs.  I narrowed this down to being due to the abnormally high window scaling 
factor we have.  We set kern.ipc.maxsockbuf to 314572800 which results in a 
window scaling factor of 8k.  This interacts poorly with the logic that 
decides whether or not to force a window update in tcp_output():

        /*
         * Compare available window to amount of window
         * known to peer (as advertised window less
         * next expected input).  If the difference is at least two
         * max size segments, or at least 50% of the maximum possible
         * window, then want to send a window update to peer.
         * Skip this if the connection is in T/TCP half-open state.
         * Don't send pure window updates when the peer has closed
         * the connection and won't ever send more data.
         */
        if (recwin > 0 && !(tp->t_flags & TF_NEEDSYN) &&
            !TCPS_HAVERCVDFIN(tp->t_state)) {
                /*
                 * "adv" is the amount we can increase the window,
                 * taking into account that we are limited by
                 * TCP_MAXWIN << tp->rcv_scale.
                 */
                long adv = min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) -
                        (tp->rcv_adv - tp->rcv_nxt);

                if (adv >= (long) (2 * tp->t_maxseg))
                        goto send;
                if (2 * adv >= (long) so->so_rcv.sb_hiwat)
                        goto send;
        }

Specifically, we can send a duplicate ACK when (2 * tp->t_maxseg) or
(so->so_rcv.sb_hiwat / 2) are less than the window scaling factor.  I have a 
test app that you can run against a TCP chargen service from inetd to 
reproduce it.  I also have two TCP dumps from before and after.  The patch I'm 
using to fix this is below (I could rework it to not use the extra goto 
perhaps, but went with a simple hack to minimize reindenting for now):

Index: tcp_output.c
===================================================================
--- tcp_output.c        (revision 217650)
+++ tcp_output.c        (working copy)
@@ -560,11 +560,19 @@
                long adv = min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) -
                        (tp->rcv_adv - tp->rcv_nxt);
 
+               /* 
+                * If the new window size ends up being the same as the old
+                * size when it is scaled, then don't force a window update.
+                */
+               if ((tp->rcv_adv - tp->rcv_nxt) >> tp->rcv_scale ==
+                   (adv + tp->rcv_adv - tp->rcv_nxt) >> tp->rcv_scale)
+                       goto dontupdate;
                if (adv >= (long) (2 * tp->t_maxseg))
                        goto send;
                if (2 * adv >= (long) so->so_rcv.sb_hiwat)
                        goto send;
        }
+dontupdate:
 
        /*
         * Send if we owe the peer an ACK, RST, SYN, or urgent data.  ACKNOW

Note that if the ACK sequence number has moved then I think other checks in 
tcp_output() will still force an ACK packet out, so I don't think this will 
cause us to miss on sending ACKs to the peers.

You can find the test app source (tcpslow.c) and the dumps at 
http://people.freebsd.org/~jhb/tcpslow/

If you look at tcp_bad.out, the receiver stops reading data the receiver's 
socket buffer fills up around packet 72 or so.  The receiver wakes up at 
packet 88 and drains the buffer causing a small storm of window updates.  
However, due to the scaling factor, it actually sends duplicate ACKs in 
batches of threes (3 ACKs for 8k window, 3 ACKs for 16k window, etc.).  This 
happens each time the receiver wakes up and drains a full socket buffer.  The 
tcp_good.out dump shows the stream with the patch applied.  A similar event of 
the receiver draining a full buffer starts at packet 83 and it sends a single 
ACK for each "real" window update.

-- 
John Baldwin