Date: Tue, 8 Feb 2011 13:26:21 -0500 From: John Baldwin <jhb@freebsd.org> To: net@freebsd.org Subject: TCP connections stuck in persist state Message-ID: <201102081326.21183.jhb@freebsd.org>
next in thread | raw e-mail | index | archive | help
I ran into a problem recently where a TCP socket seemed to never exit persist mode. What would happen is that the sender was blasting data faster than the receiver could receive it. When the receiver read some data, the sender would start sending again and everything would resume. However, after 3-4 instances of this, the sender would decide to not resume sending data when the receiver opened the window. Instead, it would slowly send a byte every few seconds via the persist timer even though the receiver was advertising a 64k window when it ACKd each of the window probes that was sent by the sender. I dug around in kgdb and found that both snd_cwnd and snd_ssthresh were set to 0 on the sender side. I think this means that the send window is effectively permamently stuck at zero as a result of this. (The tcpcb is also IN_FASTRECOVERY() on the sender side, probably from the storm of duplicate acks from the receiver when it sends a bunch of window updates (see my earlier e-mail to net@ for the source of duplicate ACKs.) Anyway, I think that this code in tcp_input() is what keeps the window at zero: /* * If the congestion window was inflated to account * for the other side's cached packets, retract it. */ if (tcp_do_newreno || (tp->t_flags & TF_SACK_PERMIT)) { if (IN_FASTRECOVERY(tp)) { if (SEQ_LT(th->th_ack, tp->snd_recover)) { if (tp->t_flags & TF_SACK_PERMIT) tcp_sack_partialack(tp, th); else tcp_newreno_partial_ack(tp, th); } else { /* * Out of fast recovery. * Window inflation should have left us * with approximately snd_ssthresh * outstanding data. * But in case we would be inclined to * send a burst, better to do it via * the slow start mechanism. */ KASSERT(tp->snd_ssthresh != 0, ("using bogus snd_ssthresh")); if (SEQ_GT(th->th_ack + tp->snd_ssthresh, tp->snd_max)) tp->snd_cwnd = tp->snd_max - th->th_ack + tp->t_maxseg; else tp->snd_cwnd = tp->snd_ssthresh; } } Specifically, since snd_recover and snd_una seem to keep advancing in lock-step with each window update, I think it ends up falling down to the last statement each time where snd_cwnd = snd_ssthresh thus keeping snd_cwnd at 0. This then causes a zero send window in tcp_output(): sendwin = min(tp->snd_wnd, tp->snd_cwnd); sendwin = min(sendwin, tp->snd_bwnd); Now, looking at the code, I can see no way that snd_ssthresh should ever be zero. It seems to always be calculated from some number of segments times t_maxseg. The one exception to this rule is when it is restored from snd_ssthresh_prev due to a bad retransmit (this example is from tcp_input()): /* * If we just performed our first retransmit, and the ACK * arrives within our recovery window, then it was a mistake * to do the retransmit in the first place. Recover our * original cwnd and ssthresh, and proceed to transmit where * we left off. */ if (tp->t_rxtshift == 1 && (int)(ticks - tp->t_badrxtwin) < 0) { ++tcpstat.tcps_sndrexmitbad; tp->snd_cwnd = tp->snd_cwnd_prev; tp->snd_ssthresh = tp->snd_ssthresh_prev; tp->snd_recover = tp->snd_recover_prev; if (tp->t_flags & TF_WASFRECOVERY) ENTER_FASTRECOVERY(tp); tp->snd_nxt = tp->snd_max; tp->t_badrxtwin = 0; /* XXX probably not required */ } So then my working theory is that somehow, snd_ssthresh_prev is being used when it hasn't been initialized. I then checked 'ticks' on my host and found that it had wrapped to a negative integer value. So that means that when t_badrxtwin is zero (which is it's default value), the second half of that expression is always true. This is fine, however, so long as t_rxtshift is always 0 so long as snd_ssthresh_prev and friends aren't initialized. That brings me to tcp_setpersist(): void tcp_setpersist(struct tcpcb *tp) { int t = ((tp->t_srtt >> 2) + tp->t_rttvar) >> 1; int tt; if (tcp_timer_active(tp, TT_REXMT)) panic("tcp_setpersist: retransmit pending"); /* * Start/restart persistance timer. */ TCPT_RANGESET(tt, t * tcp_backoff[tp->t_rxtshift], TCPTV_PERSMIN, TCPTV_PERSMAX); tcp_timer_activate(tp, TT_PERSIST, tt); if (tp->t_rxtshift < TCP_MAXRXTSHIFT) tp->t_rxtshift++; } Suppose that a connection has never had to retransmit a packet, so snd_ssthresh_prev and friends are all set to their initial values of zero. Suppose that the receiver's socket then fills up eventually triggering the persist timer. Now when tcp_setpersist() is called, it will set t_rxtshift to 1 with all those *_prev fields invalid. The next ACK that arrives in tcp_input() with the 'ticks' value negative will trash snd_cwnd and friends from the bogus *_prev fields. I believe this was exposed relatively recently when I fixed problems with 'ticks' rollover a while back as the tests used to be 'if (ticks < tp->t_badrxtwin)' which have a different set of bugs, but that would explain why this hasn't been seen before. I was never able to come up with a small test case for this. The only way I could reproduce it and test the fix was using mysql (unfortunately) though we believe we have also seen this in different production workloads as well. After some whiteboard discussion with gnn@ I came up with a firmer timeline of exactly how this state was triggered on the sender: 1) The receiver is "busy" for a while and does not service the socket causing the receive buffer to fill. Once the receiver does start reading the socket, it drains the entire socket resulting in a small storm of ACKs all for the same sequence number. The window is growing during the ACKs, however, due to a bug in our window update code and large scale values (we have a scale factor of 8k because we increase the maximum socket buffer size), several of the ACKs end up being duplicates (e.g. 3 ACKs with a window of 8k then 3 ACKs with a window of 16k, then 3 with 24k, etc.). I think these ACKs trigger fast recovery so that IN_FASTRECOVERY() is true for the sender. 2) This cycle happens a second time on receiver (getting busy, then a flood of ACKs). In this case, the receiver is busy long enough to cause the sender to enter the persist state which sets t_rxtshift to one. 3) At this point an ACK arrives after the receiver started reading again. Because ticks is -ve and badrxtwin is zero and t_rxtshift is one, the code thinks this ACK should trigger 'bad retransmit recovery'. However, snd_ssthresh_prev and snd_cwnd_prev are both zero, so this sets both snd_ssthresh and snd_cwnd to zero. 4) Because IN_FASTRECOVERY() is true, then when an ACK arrives, it triggers either NewReno or SACK partial ACK handling code which clamps snd_cwnd to be no longer than snd_ssthresh, so snd_cwnd is zero. It seems that once the socket is "stuck", then each ACK to the window probes is treated as a partial ACK perpetuating the zero congestion window forever. The solution I came up with was to add a new flag to indicate when the *_prev fields are valid and test that before triggering bad retransmit recovery. The patch: Index: netinet/tcp_input.c =================================================================== --- netinet/tcp_input.c (.../mirror/FreeBSD/stable/7/sys) (revision 219412) +++ netinet/tcp_input.c (.../stable/7/sys) (revision 219412) @@ -1030,6 +1031,7 @@ * "bad retransmit" recovery. */ if (tp->t_rxtshift == 1 && + tp->t_flags & TF_PREVVALID && (int)(ticks - tp->t_badrxtwin) < 0) { ++tcpstat.tcps_sndrexmitbad; tp->snd_cwnd = tp->snd_cwnd_prev; @@ -1039,6 +1041,7 @@ if (tp->t_flags & TF_WASFRECOVERY) ENTER_FASTRECOVERY(tp); tp->snd_nxt = tp->snd_max; + tp->t_flags &= ~TF_PREVVALID; tp->t_badrxtwin = 0; } @@ -1961,7 +1968,8 @@ * original cwnd and ssthresh, and proceed to transmit where * we left off. */ - if (tp->t_rxtshift == 1 && (int)(ticks - tp->t_badrxtwin) < 0) { + if (tp->t_rxtshift == 1 && tp->t_flags & TF_PREVVALID && + (int)(ticks - tp->t_badrxtwin) < 0) { ++tcpstat.tcps_sndrexmitbad; tp->snd_cwnd = tp->snd_cwnd_prev; tp->snd_ssthresh = tp->snd_ssthresh_prev; @@ -1969,6 +1977,7 @@ if (tp->t_flags & TF_WASFRECOVERY) ENTER_FASTRECOVERY(tp); tp->snd_nxt = tp->snd_max; + tp->t_flags &= ~TF_PREVVALID; tp->t_badrxtwin = 0; /* XXX probably not required */ } Index: netinet/tcp_timer.c =================================================================== --- netinet/tcp_timer.c (.../mirror/FreeBSD/stable/7/sys) (revision 219412) +++ netinet/tcp_timer.c (.../stable/7/sys) (revision 219412) @@ -476,7 +476,9 @@ else tp->t_flags &= ~TF_WASFRECOVERY; tp->t_badrxtwin = ticks + (tp->t_srtt >> (TCP_RTT_SHIFT + 1)); - } + tp->t_flags |= TF_PREVVALID; + } else + tp->t_flags &= ~TF_PREVVALID; tcpstat.tcps_rexmttimeo++; if (tp->t_state == TCPS_SYN_SENT) rexmt = TCP_REXMTVAL(tp) * tcp_syn_backoff[tp->t_rxtshift]; Index: netinet/tcp_var.h =================================================================== --- netinet/tcp_var.h (.../mirror/FreeBSD/stable/7/sys) (revision 219412) +++ netinet/tcp_var.h (.../stable/7/sys) (revision 219412) @@ -201,6 +221,7 @@ #define TF_NEEDSYN 0x000400 /* send SYN (implicit state) */ #define TF_NEEDFIN 0x000800 /* send FIN (implicit state) */ #define TF_NOPUSH 0x001000 /* don't push */ +#define TF_PREVVALID 0x002000 /* saved values for bad rxmit valid */ #define TF_MORETOCOME 0x010000 /* More data to be appended to sock */ #define TF_LQ_OVERFLOW 0x020000 /* listen queue overflow */ #define TF_LASTIDLE 0x040000 /* connection was previously idle */ Index: netinet/tcp_output.c =================================================================== --- netinet/tcp_output.c (.../mirror/FreeBSD/stable/7/sys) (revision 219412) +++ netinet/tcp_output.c (.../stable/7/sys) (revision 219412) @@ -1258,6 +1269,7 @@ int t = ((tp->t_srtt >> 2) + tp->t_rttvar) >> 1; int tt; + tp->t_flags &= ~TF_PREVVALID; if (tcp_timer_active(tp, TT_REXMT)) panic("tcp_setpersist: retransmit pending"); /* -- John Baldwin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201102081326.21183.jhb>