From owner-freebsd-net@FreeBSD.ORG Tue Feb 8 17:01:59 2011 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B1A0A106566C for ; Tue, 8 Feb 2011 17:01:59 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 6EABC8FC17 for ; Tue, 8 Feb 2011 17:01:59 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 06D5E46B06 for ; Tue, 8 Feb 2011 12:01:59 -0500 (EST) Received: from jhbbsd.localnet (unknown [209.249.190.10]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 101528A009 for ; Tue, 8 Feb 2011 12:01:58 -0500 (EST) From: John Baldwin To: net@freebsd.org Date: Tue, 8 Feb 2011 11:55:33 -0500 User-Agent: KMail/1.13.5 (FreeBSD/7.4-CBSD-20110107; KDE/4.4.5; amd64; ; ) MIME-Version: 1.0 Content-Type: Text/Plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Message-Id: <201102081155.33268.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.6 (bigwig.baldwin.cx); Tue, 08 Feb 2011 12:01:58 -0500 (EST) X-Virus-Scanned: clamav-milter 0.96.3 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Flag: YES X-Spam-Status: Yes, score=6.8 required=4.2 tests=BAYES_00,MAY_BE_FORGED, RDNS_DYNAMIC, TO_NO_BRKTS_DIRECT, TO_NO_BRKTS_DYNIP autolearn=no version=3.3.1 X-Spam-Report: * -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% * [score: 0.0000] * 1.0 RDNS_DYNAMIC Delivered to internal network by host with * dynamic-looking rDNS * 1.4 MAY_BE_FORGED Relay IP's reverse DNS does not resolve to IP * 2.6 TO_NO_BRKTS_DIRECT To: misformatted and direct-to-MX * 3.7 TO_NO_BRKTS_DYNIP To: misformatted and dynamic rDNS X-Spam-Level: ****** X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on bigwig.baldwin.cx Cc: Subject: TCP can advertise a really huge window X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 Feb 2011 17:01:59 -0000 This is a very bizarre edge case, so bear with me. I was debugging an edge case at work recently that occurred when the socket buffer was filled up exactly (i.e. sbspace(&so->so_rcv) == 0). In TCP terms, this would be when rcv_nxt == rcv_adv. To simulate the real workload I had a very fast writer blasting over lo0 to a slow reader but had used a small buffer and turned off window scaling (as I had to fill the entire socket buffer, I was chasing an off-by-1 bug). However, I ended up with some bizarre behavior. I think it is less confusing to describe the sequence of events that I now know happened than how I figured this out, so here goes. - Assume we have advertised a window size of N which corresponds exactly to sbspace(&so->so_rcv). - The remote peer sends a packet of length N filling our window. We respond with a zero-window ACK. This advances rcv_nxt to == rcv_adv, but it does not grow rcv_adv because sbspace() is currently 0. - The userland app very slowly drains data from the socket buffer. However, the calls to tcp_usr_recvd() do not trigger a window update because in this case the link is over lo0 which has a relatively large t_maxseg (about 14k) and this condition in tcp_output() is not met: if (adv >= (long) (2 * tp->t_maxseg)) goto send; if (2 * adv >= (long) so->so_rcv.sb_hiwat) goto send; - A timer at the remote peer expires and it sends a window probe with one byte of data. Since userland has read some data (just not 2 * MSS), we accept this packet. However, receiving this packet moves rcv_nxt += 1, so rcv_nxt is now > rcv_adv. - We call tcp_output() to ACK the window probe and as part of this calculate the receive window to advertise here: if (recwin < (long)(so->so_rcv.sb_hiwat / 4) && recwin < (long)tp->t_maxseg) recwin = 0; if (recwin < (long)(tp->rcv_adv - tp->rcv_nxt)) recwin = (long)(tp->rcv_adv - tp->rcv_nxt); if (recwin > (long)TCP_MAXWIN << tp->rcv_scale) recwin = (long)TCP_MAXWIN << tp->rcv_scale; The "surprise" kicks in on the second conditional. The problem is that rcv_adv - rcv_nxt is now equal to (uint32_t)-1. On a 32-bit machine the cast to (long) effectively just makes this value signed and thus -1. On a 64-bit machine you actually end up with a ginormous value of 2^32 - 1, or a 4GB window (minus a byte). The third conditional truncates that to the maximum window we can advertise, but this value may be larger than the actual space in the socket buffer. The remote peer now has a huge window to throw data into. At work this proved disastrous. I'm not sure if there are any practical concerns. This is the patch I'm using as a fix: Index: tcp_output.c =================================================================== --- tcp_output.c (revision 215582) +++ tcp_output.c (working copy) @@ -928,7 +928,8 @@ if (recwin < (long)(so->so_rcv.sb_hiwat / 4) && recwin < (long)tp->t_maxseg) recwin = 0; - if (recwin < (long)(tp->rcv_adv - tp->rcv_nxt)) + if (SEQ_GT(tp->rcv_adv, tp->rcv_nxt) && + recwin < (long)(tp->rcv_adv - tp->rcv_nxt)) recwin = (long)(tp->rcv_adv - tp->rcv_nxt); if (recwin > (long)TCP_MAXWIN << tp->rcv_scale) recwin = (long)TCP_MAXWIN << tp->rcv_scale; -- John Baldwin