Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 14 May 2011 13:37:19 +0300
From:      Mikolaj Golub <trociny@freebsd.org>
To:        John Baldwin <jhb@FreeBSD.org>
Cc:        svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org
Subject:   Re: svn commit: r221346 - head/sys/netinet
Message-ID:  <86pqnlbmao.fsf@kopusha.home.net>
In-Reply-To: <201105022105.p42L5q3j054498@svn.freebsd.org> (John Baldwin's message of "Mon, 2 May 2011 21:05:52 %2B0000 (UTC)")
References:  <201105022105.p42L5q3j054498@svn.freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Hi,

On Mon, 2 May 2011 21:05:52 +0000 (UTC) John Baldwin wrote:

 JB> Author: jhb
 JB> Date: Mon May  2 21:05:52 2011
 JB> New Revision: 221346
 JB> URL: http://svn.freebsd.org/changeset/base/221346

 JB> Log:
 JB>   Handle a rare edge case with nearly full TCP receive buffers.  If a TCP
 JB>   buffer fills up causing the remote sender to enter into persist mode, but
 JB>   there is still room available in the receive buffer when a window probe
 JB>   arrives (either due to window scaling, or due to the local application
 JB>   very slowing draining data from the receive buffer), then the single byte
 JB>   of data in the window probe is accepted.  However, this can cause rcv_nxt
 JB>   to be greater than rcv_adv.  This condition will only last until the next
 JB>   ACK packet is pushed out via tcp_output(), and since the previous ACK
 JB>   advertised a zero window, the ACK should be pushed out while the TCP
 JB>   pcb is write-locked.
 JB>   
 JB>   During the window while rcv_nxt is greather than rcv_adv, a few places
 JB>   would compute the remaining receive window via rcv_adv - rcv_nxt.
 JB>   However, this value was then (uint32_t)-1.  On a 64 bit machine this
 JB>   could expand to a positive 2^32 - 1 when cast to a long.  In particular,
 JB>   when calculating the receive window in tcp_output(), the result would be
 JB>   that the receive window was computed as 2^32 - 1 resulting in advertising
 JB>   a far larger window to the remote peer than actually existed.
 JB>   
 JB>   Fix various places that compute the remaining receive window to either
 JB>   assert that it is not negative (i.e. rcv_nxt <= rcv_adv), or treat the
 JB>   window as full if rcv_nxt is greather than rcv_adv.
 JB>   
 JB>   Reviewed by:        bz
 JB>   MFC after:        1 month

 JB> Modified:
 JB>   head/sys/netinet/tcp_input.c
 JB>   head/sys/netinet/tcp_output.c
 JB>   head/sys/netinet/tcp_timewait.c

 JB> Modified: head/sys/netinet/tcp_input.c
 JB> ==============================================================================
 JB> --- head/sys/netinet/tcp_input.c        Mon May  2 21:04:37 2011        (r221345)
 JB> +++ head/sys/netinet/tcp_input.c        Mon May  2 21:05:52 2011        (r221346)
 JB> @@ -1831,6 +1831,9 @@ tcp_do_segment(struct mbuf *m, struct tc
 JB>          win = sbspace(&so->so_rcv);
 JB>          if (win < 0)
 JB>                  win = 0;
 JB> +        KASSERT(SEQ_GEQ(tp->rcv_adv, tp->rcv_nxt),
 JB> +            ("tcp_input negative window: tp %p rcv_nxt %u rcv_adv %u", tp,
 JB> +            tp->rcv_adv, tp->rcv_nxt));

I am getting this when running tests with HAST (both primary and secondary HAST
instances on the same host).

HAST is synchronizing data in MAXPHYS (131072 bytes) blocks. The sender splits
them on smaller chunks of MAX_SEND_SIZE (32768 bytes), while the receiver
receives the whole block calling recv() with MSG_WAITALL option.

FreeBSD kopusha.home.net 9.0-CURRENT FreeBSD 9.0-CURRENT #6 r221878: Sat May 14 11:44:42 EEST 2011     root@kopusha.home.net:/usr/obj/usr/src/sys/GENERIC  i386

panic: tcp_input negative window: tp 0xc9777ad0 rcv_nxt 1530593650 rcv_adv 1530593651

#0  doadump () at pcpu.h:244
#1  0xc04ddac9 in db_fncall (dummy1=-1063410006, dummy2=0, dummy3=-1, 
    dummy4=0xe67547f0 "\004Huæ") at /usr/src/sys/ddb/db_command.c:548
#2  0xc04ddeff in db_command (last_cmdp=0xc0fcdbfc, cmd_table=0x0, dopager=0)
    at /usr/src/sys/ddb/db_command.c:445
#3  0xc04ddfb4 in db_command_script (command=0xc0fceb04 "call doadump")
    at /usr/src/sys/ddb/db_command.c:516
#4  0xc04e2280 in db_script_exec (scriptname=0xe67548fc "kdb.enter.panic", 
    warnifnotfound=Variable "warnifnotfound" is not available.
) at /usr/src/sys/ddb/db_script.c:302
#5  0xc04e2367 in db_script_kdbenter (eventname=0xc0e7d4ee "panic")
    at /usr/src/sys/ddb/db_script.c:324
#6  0xc04dff98 in db_trap (type=3, code=0) at /usr/src/sys/ddb/db_main.c:228
#7  0xc09da822 in kdb_trap (type=3, code=0, tf=0xe6754a40)
    at /usr/src/sys/kern/subr_kdb.c:533
#8  0xc0ccc2eb in trap (frame=0xe6754a40) at /usr/src/sys/i386/i386/trap.c:719
#9  0xc0cb4fec in calltrap () at /usr/src/sys/i386/i386/exception.s:168
#10 0xc09da6aa in kdb_enter (why=0xc0e7d4ee "panic", msg=0xc0e7d4ee "panic")
    at cpufunc.h:71
#11 0xc09a5db4 in panic (
    fmt=0xc0ea2d6b "tcp_input negative window: tp %p rcv_nxt %u rcv_adv %u")
    at /usr/src/sys/kern/kern_shutdown.c:584
#12 0xc0b29cdb in tcp_do_segment (m=0xc9aac800, th=0xc9aac874, so=0xc9e0b680, 
    tp=0xc9777ad0, drop_hdrlen=52, tlen=1, iptos=0 '\0', ti_locked=3)
---Type <return> to continue, or q <return> to quit---
    at /usr/src/sys/netinet/tcp_input.c:1834
#13 0xc0b2d309 in tcp_input (m=0xc9aac800, off0=20)
    at /usr/src/sys/netinet/tcp_input.c:1369
#14 0xc0ac4676 in ip_input (m=0xc9aac800)
    at /usr/src/sys/netinet/ip_input.c:765
#15 0xc0a6948a in swi_net (arg=0xc1425880) at /usr/src/sys/net/netisr.c:653
#16 0xc097c675 in intr_event_execute_handlers (p=0xc65bf578, ie=0xc6608280)
    at /usr/src/sys/kern/kern_intr.c:1257
#17 0xc097d559 in ithread_loop (arg=0xc65836a0)
    at /usr/src/sys/kern/kern_intr.c:1270
#18 0xc0979928 in fork_exit (callout=0xc097d4b0 <ithread_loop>, 
    arg=0xc65836a0, frame=0xe6754d28) at /usr/src/sys/kern/kern_fork.c:920
#19 0xc0cb5064 in fork_trampoline () at /usr/src/sys/i386/i386/exception.s:275
(kgdb) fr 12
#12 0xc0b29cdb in tcp_do_segment (m=0xc9aac800, th=0xc9aac874, so=0xc9e0b680, 
    tp=0xc9777ad0, drop_hdrlen=52, tlen=1, iptos=0 '\0', ti_locked=3)
    at /usr/src/sys/netinet/tcp_input.c:1834
1834		KASSERT(SEQ_GEQ(tp->rcv_adv, tp->rcv_nxt),
(kgdb) p *tp
$1 = {
  t_segq = {
    lh_first = 0x0
  }, 
  t_pspare = {0x0, 0x0}, 
  t_segqlen = 0, 
  t_dupacks = 0, 
  t_timers = 0xc9777cbc, 
  t_inpcb = 0xc8c4cdc0, 
  t_state = 9, 
  t_flags = 525300, 
  t_vnet = 0x0, 
  snd_una = 1602773057, 
  snd_max = 1602773057, 
  snd_nxt = 1602773057, 
  snd_up = 1602773057, 
  snd_wl1 = 1530593650, 
  snd_wl2 = 1602773057, 
  iss = 1602772219, 
  irs = 1499210531, 
  rcv_nxt = 1530593651, 
  rcv_adv = 1530593650, 
  rcv_wnd = 671, 
---Type <return> to continue, or q <return> to quit---
  rcv_up = 1530593650, 
  snd_wnd = 70872, 
  snd_cwnd = 29510, 
  snd_spare1 = 0, 
  snd_ssthresh = 1073725440, 
  snd_spare2 = 0, 
  snd_recover = 1602773057, 
  t_maxopd = 16344, 
  t_rcvtime = 175861, 
  t_starttime = 171604, 
  t_rtttime = 0, 
  t_rtseq = 1602772220, 
  t_bw_spare1 = 0, 
  t_bw_spare2 = 0, 
  t_rxtcur = 31, 
  t_maxseg = 14336, 
  t_srtt = 56, 
  t_rttvar = 37, 
  t_rxtshift = 0, 
  t_rttmin = 3, 
  t_rttbest = 40, 
  t_rttupdated = 3, 
  max_sndwnd = 71680, 
---Type <return> to continue, or q <return> to quit---
  t_softerror = 0, 
  t_oobflags = 0 '\0', 
  t_iobc = 0 '\0', 
  snd_scale = 3 '\003', 
  rcv_scale = 3 '\003', 
  request_r_scale = 3 '\003', 
  ts_recent = 175361, 
  ts_recent_age = 175361, 
  ts_offset = 2326744134, 
  last_ack_sent = 1530593651, 
  snd_cwnd_prev = 0, 
  snd_ssthresh_prev = 0, 
  snd_recover_prev = 0, 
  t_sndzerowin = 4, 
  t_badrxtwin = 0, 
  snd_limited = 0 '\0', 
  snd_numholes = 0, 
  snd_holes = {
    tqh_first = 0x0, 
    tqh_last = 0xc9777bb8
  }, 
  snd_fack = 0, 
  rcv_numsacks = 0, 
---Type <return> to continue, or q <return> to quit---
  sackblks = {{
      start = 0, 
      end = 0
    }, {
      start = 0, 
      end = 0
    }, {
      start = 0, 
      end = 0
    }, {
      start = 0, 
      end = 0
    }, {
      start = 0, 
      end = 0
    }, {
      start = 0, 
      end = 0
    }}, 
  sack_newdata = 0, 
  sackhint = {
    nexthole = 0x0, 
    sack_bytes_rexmit = 0, 
---Type <return> to continue, or q <return> to quit---
    last_sack_ack = 0, 
    ispare = 0, 
    _pad = {0, 0}
  }, 
  t_rttlow = 0, 
  rfbuf_ts = 175361, 
  rfbuf_cnt = 0, 
  t_tu = 0x0, 
  t_sndrexmitpack = 0, 
  t_rcvoopack = 0, 
  t_toe = 0x0, 
  t_bytes_acked = 0, 
  cc_algo = 0xc0fa6960, 
  ccv = 0xc9777d5c, 
  osd = 0xc9777d74, 
  t_ispare = 0, 
  t_pspare2 = {0x0, 0x0, 0x0, 0x0}, 
  _pad = {0 <repeats 12 times>}
}
(kgdb) p *so
$2 = {
  so_count = 1, 
  so_type = 1, 
  so_options = 4, 
  so_linger = 0, 
  so_state = 2, 
  so_qstate = 0, 
  so_pcb = 0xc8c4cdc0, 
  so_vnet = 0x0, 
  so_proto = 0xc0fa56a8, 
  so_head = 0x0, 
  so_incomp = {
    tqh_first = 0x0, 
    tqh_last = 0x0
  }, 
  so_comp = {
    tqh_first = 0x0, 
    tqh_last = 0x0
  }, 
  so_list = {
    tqe_next = 0x0, 
    tqe_prev = 0xc9e0ab88
  }, 
---Type <return> to continue, or q <return> to quit---
  so_qlen = 0, 
  so_incqlen = 0, 
  so_qlimit = 0, 
  so_timeo = 0, 
  so_error = 0, 
  so_sigio = 0x0, 
  so_oobmark = 0, 
  so_aiojobq = {
    tqh_first = 0x0, 
    tqh_last = 0xc9e0b6cc
  }, 
  so_rcv = {
    sb_sel = {
      si_tdlist = {
        tqh_first = 0x0, 
        tqh_last = 0x0
      }, 
      si_note = {
        kl_list = {
          slh_first = 0x0
        }, 
        kl_lock = 0xc0970a90 <knlist_mtx_lock>, 
        kl_unlock = 0xc0970ac0 <knlist_mtx_unlock>, 
---Type <return> to continue, or q <return> to quit---
        kl_assert_locked = 0xc0970e10 <knlist_mtx_assert_locked>, 
        kl_assert_unlocked = 0xc0970de0 <knlist_mtx_assert_unlocked>, 
        kl_lockarg = 0xc9e0b6f8
      }, 
      si_mtx = 0x0
    }, 
    sb_mtx = {
      lock_object = {
        lo_name = 0xc0e84d6b "so_rcv", 
        lo_flags = 16973824, 
        lo_data = 0, 
        lo_witness = 0xc655a618
      }, 
      mtx_lock = 4
    }, 
    sb_sx = {
      lock_object = {
        lo_name = 0xc0e75c2c "so_rcv_sx", 
        lo_flags = 36896768, 
        lo_data = 0, 
        lo_witness = 0xc65634b0
      }, 
      sx_lock = 3356203168
---Type <return> to continue, or q <return> to quit---
    }, 
    sb_state = 0, 
    sb_mb = 0xc9aab800, 
    sb_mbtail = 0xc88f0000, 
    sb_lastrecord = 0xc9aab800, 
    sb_sndptr = 0x0, 
    sb_sndptroff = 0, 
    sb_cc = 71010, 
    sb_hiwat = 71680, 
    sb_mbcnt = 85760, 
    sb_mcnt = 23, 
    sb_ccnt = 20, 
    sb_mbmax = 262144, 
    sb_ctl = 0, 
    sb_lowat = 1, 
    sb_timeo = 2000, 
    sb_flags = 2052, 
    sb_upcall = 0, 
    sb_upcallarg = 0x0
  }, 
  so_snd = {
    sb_sel = {
      si_tdlist = {
---Type <return> to continue, or q <return> to quit---
        tqh_first = 0x0, 
        tqh_last = 0x0
      }, 
      si_note = {
        kl_list = {
          slh_first = 0x0
        }, 
        kl_lock = 0xc0970a90 <knlist_mtx_lock>, 
        kl_unlock = 0xc0970ac0 <knlist_mtx_unlock>, 
        kl_assert_locked = 0xc0970e10 <knlist_mtx_assert_locked>, 
        kl_assert_unlocked = 0xc0970de0 <knlist_mtx_assert_unlocked>, 
        kl_lockarg = 0xc9e0b78c
      }, 
      si_mtx = 0x0
    }, 
    sb_mtx = {
      lock_object = {
        lo_name = 0xc0e84d64 "so_snd", 
        lo_flags = 16973824, 
        lo_data = 0, 
        lo_witness = 0xc655a5b0
      }, 
      mtx_lock = 4
---Type <return> to continue, or q <return> to quit---
    }, 
    sb_sx = {
      lock_object = {
        lo_name = 0xc0e75c22 "so_snd_sx", 
        lo_flags = 36896768, 
        lo_data = 0, 
        lo_witness = 0xc6563448
      }, 
      sx_lock = 1
    }, 
    sb_state = 16, 
    sb_mb = 0x0, 
    sb_mbtail = 0x0, 
    sb_lastrecord = 0x0, 
    sb_sndptr = 0x0, 
    sb_sndptroff = 0, 
    sb_cc = 0, 
    sb_hiwat = 43008, 
    sb_mbcnt = 0, 
    sb_mcnt = 0, 
    sb_ccnt = 0, 
    sb_mbmax = 262144, 
    sb_ctl = 0, 
---Type <return> to continue, or q <return> to quit---
    sb_lowat = 2048, 
    sb_timeo = 2000, 
    sb_flags = 2048, 
    sb_upcall = 0, 
    sb_upcallarg = 0x0
  }, 
  so_cred = 0xc879b700, 
  so_label = 0x0, 
  so_peerlabel = 0x0, 
  so_gencnt = 1987, 
  so_emuldata = 0x0, 
  so_accf = 0x0, 
  so_fibnum = 0, 
  so_user_cookie = 0
}
(kgdb) 


 JB>          tp->rcv_wnd = imax(win, (int)(tp->rcv_adv - tp->rcv_nxt));
 JB>  
 JB>          /* Reset receive buffer auto scaling when not in bulk receive mode. */
 JB> @@ -2868,7 +2871,10 @@ dodata:                                                        /* XXX */
 JB>                   * buffer size.
 JB>                   * XXX: Unused.
 JB>                   */
 JB> -                len = so->so_rcv.sb_hiwat - (tp->rcv_adv - tp->rcv_nxt);
 JB> +                if (SEQ_GT(tp->rcv_adv, tp->rcv_nxt))
 JB> +                        len = so->so_rcv.sb_hiwat - (tp->rcv_adv - tp->rcv_nxt);
 JB> +                else
 JB> +                        len = so->so_rcv.sb_hiwat;
 JB>  #endif
 JB>          } else {
 JB>                  m_freem(m);

 JB> Modified: head/sys/netinet/tcp_output.c
 JB> ==============================================================================
 JB> --- head/sys/netinet/tcp_output.c        Mon May  2 21:04:37 2011        (r221345)
 JB> +++ head/sys/netinet/tcp_output.c        Mon May  2 21:05:52 2011        (r221346)
 JB> @@ -561,15 +561,21 @@ after_sack_rexmit:
 JB>                   * taking into account that we are limited by
 JB>                   * TCP_MAXWIN << tp->rcv_scale.
 JB>                   */
 JB> -                long adv = min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) -
 JB> -                        (tp->rcv_adv - tp->rcv_nxt);
 JB> +                long adv;
 JB> +                int oldwin;
 JB> +
 JB> +                adv = min(recwin, (long)TCP_MAXWIN << tp->rcv_scale);
 JB> +                if (SEQ_GT(tp->rcv_adv, tp->rcv_nxt)) {
 JB> +                        oldwin = (tp->rcv_adv - tp->rcv_nxt);
 JB> +                        adv -= oldwin;
 JB> +                } else
 JB> +                        oldwin = 0;
 JB>  
 JB>                  /* 
 JB>                   * If the new window size ends up being the same as the old
 JB>                   * size when it is scaled, then don't force a window update.
 JB>                   */
 JB> -                if ((tp->rcv_adv - tp->rcv_nxt) >> tp->rcv_scale ==
 JB> -                    (adv + tp->rcv_adv - tp->rcv_nxt) >> tp->rcv_scale)
 JB> +                if (oldwin >> tp->rcv_scale == (adv + oldwin) >> tp->rcv_scale)
 JB>                          goto dontupdate;
 JB>                  if (adv >= (long) (2 * tp->t_maxseg))
 JB>                          goto send;
 JB> @@ -1008,7 +1014,8 @@ send:
 JB>          if (recwin < (long)(so->so_rcv.sb_hiwat / 4) &&
 JB>              recwin < (long)tp->t_maxseg)
 JB>                  recwin = 0;
 JB> -        if (recwin < (long)(tp->rcv_adv - tp->rcv_nxt))
 JB> +        if (SEQ_GT(tp->rcv_adv, tp->rcv_nxt) &&
 JB> +            recwin < (long)(tp->rcv_adv - tp->rcv_nxt))
 JB>                  recwin = (long)(tp->rcv_adv - tp->rcv_nxt);
 JB>          if (recwin > (long)TCP_MAXWIN << tp->rcv_scale)
 JB>                  recwin = (long)TCP_MAXWIN << tp->rcv_scale;

 JB> Modified: head/sys/netinet/tcp_timewait.c
 JB> ==============================================================================
 JB> --- head/sys/netinet/tcp_timewait.c        Mon May  2 21:04:37 2011        (r221345)
 JB> +++ head/sys/netinet/tcp_timewait.c        Mon May  2 21:05:52 2011        (r221346)
 JB> @@ -242,6 +242,9 @@ tcp_twstart(struct tcpcb *tp)
 JB>          /*
 JB>           * Recover last window size sent.
 JB>           */
 JB> +        KASSERT(SEQ_GEQ(tp->rcv_adv, tp->rcv_nxt),
 JB> +            ("tcp_twstart negative window: tp %p rcv_nxt %u rcv_adv %u", tp,
 JB> +            tp->rcv_adv, tp->rcv_nxt));
 JB>          tw->last_win = (tp->rcv_adv - tp->rcv_nxt) >> tp->rcv_scale;
 JB>  
 JB>          /*

-- 
Mikolaj Golub



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?86pqnlbmao.fsf>