From owner-freebsd-net@FreeBSD.ORG Tue Feb 8 17:02:03 2011 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 73DEB10656C4 for ; Tue, 8 Feb 2011 17:02:03 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 3E3AF8FC15 for ; Tue, 8 Feb 2011 17:02:03 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id DD70246B03 for ; Tue, 8 Feb 2011 12:02:02 -0500 (EST) Received: from jhbbsd.localnet (unknown [209.249.190.10]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 13B278A01D for ; Tue, 8 Feb 2011 12:01:59 -0500 (EST) From: John Baldwin To: net@freebsd.org Date: Tue, 8 Feb 2011 12:01:54 -0500 User-Agent: KMail/1.13.5 (FreeBSD/7.4-CBSD-20110107; KDE/4.4.5; amd64; ; ) MIME-Version: 1.0 Content-Type: Text/Plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Message-Id: <201102081201.54250.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.6 (bigwig.baldwin.cx); Tue, 08 Feb 2011 12:01:59 -0500 (EST) X-Virus-Scanned: clamav-milter 0.96.3 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Flag: YES X-Spam-Status: Yes, score=6.8 required=4.2 tests=BAYES_00,MAY_BE_FORGED, RDNS_DYNAMIC, TO_NO_BRKTS_DIRECT, TO_NO_BRKTS_DYNIP autolearn=no version=3.3.1 X-Spam-Report: * -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% * [score: 0.0000] * 1.0 RDNS_DYNAMIC Delivered to internal network by host with * dynamic-looking rDNS * 1.4 MAY_BE_FORGED Relay IP's reverse DNS does not resolve to IP * 2.6 TO_NO_BRKTS_DIRECT To: misformatted and direct-to-MX * 3.7 TO_NO_BRKTS_DYNIP To: misformatted and dynamic rDNS X-Spam-Level: ****** X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on bigwig.baldwin.cx Cc: Subject: A small TCP bug: excessive duplicate ACKs X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 Feb 2011 17:02:03 -0000 One thing I've noticed at work is that if a receiver's socket buffer fills and the receiver then drains the buffer all at once, we send a lot of duplicate ACKs. I narrowed this down to being due to the abnormally high window scaling factor we have. We set kern.ipc.maxsockbuf to 314572800 which results in a window scaling factor of 8k. This interacts poorly with the logic that decides whether or not to force a window update in tcp_output(): /* * Compare available window to amount of window * known to peer (as advertised window less * next expected input). If the difference is at least two * max size segments, or at least 50% of the maximum possible * window, then want to send a window update to peer. * Skip this if the connection is in T/TCP half-open state. * Don't send pure window updates when the peer has closed * the connection and won't ever send more data. */ if (recwin > 0 && !(tp->t_flags & TF_NEEDSYN) && !TCPS_HAVERCVDFIN(tp->t_state)) { /* * "adv" is the amount we can increase the window, * taking into account that we are limited by * TCP_MAXWIN << tp->rcv_scale. */ long adv = min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) - (tp->rcv_adv - tp->rcv_nxt); if (adv >= (long) (2 * tp->t_maxseg)) goto send; if (2 * adv >= (long) so->so_rcv.sb_hiwat) goto send; } Specifically, we can send a duplicate ACK when (2 * tp->t_maxseg) or (so->so_rcv.sb_hiwat / 2) are less than the window scaling factor. I have a test app that you can run against a TCP chargen service from inetd to reproduce it. I also have two TCP dumps from before and after. The patch I'm using to fix this is below (I could rework it to not use the extra goto perhaps, but went with a simple hack to minimize reindenting for now): Index: tcp_output.c =================================================================== --- tcp_output.c (revision 217650) +++ tcp_output.c (working copy) @@ -560,11 +560,19 @@ long adv = min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) - (tp->rcv_adv - tp->rcv_nxt); + /* + * If the new window size ends up being the same as the old + * size when it is scaled, then don't force a window update. + */ + if ((tp->rcv_adv - tp->rcv_nxt) >> tp->rcv_scale == + (adv + tp->rcv_adv - tp->rcv_nxt) >> tp->rcv_scale) + goto dontupdate; if (adv >= (long) (2 * tp->t_maxseg)) goto send; if (2 * adv >= (long) so->so_rcv.sb_hiwat) goto send; } +dontupdate: /* * Send if we owe the peer an ACK, RST, SYN, or urgent data. ACKNOW Note that if the ACK sequence number has moved then I think other checks in tcp_output() will still force an ACK packet out, so I don't think this will cause us to miss on sending ACKs to the peers. You can find the test app source (tcpslow.c) and the dumps at http://people.freebsd.org/~jhb/tcpslow/ If you look at tcp_bad.out, the receiver stops reading data the receiver's socket buffer fills up around packet 72 or so. The receiver wakes up at packet 88 and drains the buffer causing a small storm of window updates. However, due to the scaling factor, it actually sends duplicate ACKs in batches of threes (3 ACKs for 8k window, 3 ACKs for 16k window, etc.). This happens each time the receiver wakes up and drains a full socket buffer. The tcp_good.out dump shows the stream with the patch applied. A similar event of the receiver draining a full buffer starts at packet 83 and it sends a single ACK for each "real" window update. -- John Baldwin