Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 8 Feb 2011 11:55:33 -0500
From:      John Baldwin <jhb@freebsd.org>
To:        net@freebsd.org
Subject:   TCP can advertise a really huge window
Message-ID:  <201102081155.33268.jhb@freebsd.org>

next in thread | raw e-mail | index | archive | help
This is a very bizarre edge case, so bear with me.  I was debugging an edge 
case at work recently that occurred when the socket buffer was filled up 
exactly (i.e. sbspace(&so->so_rcv) == 0).  In TCP terms, this would be when 
rcv_nxt == rcv_adv.  To simulate the real workload I had a very fast writer 
blasting over lo0 to a slow reader but had used a small buffer and turned off 
window scaling (as I had to fill the entire socket buffer, I was chasing an 
off-by-1 bug).  However, I ended up with some bizarre behavior.

I think it is less confusing to describe the sequence of events that I now 
know happened than how I figured this out, so here goes.

- Assume we have advertised a window size of N which corresponds exactly to
  sbspace(&so->so_rcv).
- The remote peer sends a packet of length N filling our window.  We respond
  with a zero-window ACK.  This advances rcv_nxt to == rcv_adv, but it does
  not grow rcv_adv because sbspace() is currently 0.
- The userland app very slowly drains data from the socket buffer.  However,
  the calls to tcp_usr_recvd() do not trigger a window update because in this
  case the link is over lo0 which has a relatively large t_maxseg (about 14k)
  and this condition in tcp_output() is not met:

                if (adv >= (long) (2 * tp->t_maxseg))
                        goto send;
                if (2 * adv >= (long) so->so_rcv.sb_hiwat)
                        goto send;

- A timer at the remote peer expires and it sends a window probe with one
  byte of data.  Since userland has read some data (just not 2 * MSS), we
  accept this packet.  However, receiving this packet moves rcv_nxt += 1,
  so rcv_nxt is now > rcv_adv.
- We call tcp_output() to ACK the window probe and as part of this calculate
  the receive window to advertise here:

        if (recwin < (long)(so->so_rcv.sb_hiwat / 4) &&
            recwin < (long)tp->t_maxseg)
                recwin = 0;
        if (recwin < (long)(tp->rcv_adv - tp->rcv_nxt))
                recwin = (long)(tp->rcv_adv - tp->rcv_nxt);
        if (recwin > (long)TCP_MAXWIN << tp->rcv_scale)
                recwin = (long)TCP_MAXWIN << tp->rcv_scale;

The "surprise" kicks in on the second conditional.  The problem is that 
rcv_adv - rcv_nxt is now equal to (uint32_t)-1.  On a 32-bit machine the cast 
to (long) effectively just makes this value signed and thus -1.  On a 64-bit 
machine you actually end up with a ginormous value of 2^32 - 1, or a 4GB 
window (minus a byte).  The third conditional truncates that to the maximum 
window we can advertise, but this value may be larger than the actual space in 
the socket buffer.  The remote peer now has a huge window to throw data into.

At work this proved disastrous.  I'm not sure if there are any practical 
concerns.  This is the patch I'm using as a fix:

Index: tcp_output.c
===================================================================
--- tcp_output.c        (revision 215582)
+++ tcp_output.c        (working copy)
@@ -928,7 +928,8 @@
        if (recwin < (long)(so->so_rcv.sb_hiwat / 4) &&
            recwin < (long)tp->t_maxseg)
                recwin = 0;
-       if (recwin < (long)(tp->rcv_adv - tp->rcv_nxt))
+       if (SEQ_GT(tp->rcv_adv, tp->rcv_nxt) &&
+           recwin < (long)(tp->rcv_adv - tp->rcv_nxt))
                recwin = (long)(tp->rcv_adv - tp->rcv_nxt);
        if (recwin > (long)TCP_MAXWIN << tp->rcv_scale)
                recwin = (long)TCP_MAXWIN << tp->rcv_scale;

-- 
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201102081155.33268.jhb>