Date: Sat, 14 May 2011 13:37:19 +0300 From: Mikolaj Golub <trociny@freebsd.org> To: John Baldwin <jhb@FreeBSD.org> Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org Subject: Re: svn commit: r221346 - head/sys/netinet Message-ID: <86pqnlbmao.fsf@kopusha.home.net> In-Reply-To: <201105022105.p42L5q3j054498@svn.freebsd.org> (John Baldwin's message of "Mon, 2 May 2011 21:05:52 %2B0000 (UTC)") References: <201105022105.p42L5q3j054498@svn.freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Hi, On Mon, 2 May 2011 21:05:52 +0000 (UTC) John Baldwin wrote: JB> Author: jhb JB> Date: Mon May 2 21:05:52 2011 JB> New Revision: 221346 JB> URL: http://svn.freebsd.org/changeset/base/221346 JB> Log: JB> Handle a rare edge case with nearly full TCP receive buffers. If a TCP JB> buffer fills up causing the remote sender to enter into persist mode, but JB> there is still room available in the receive buffer when a window probe JB> arrives (either due to window scaling, or due to the local application JB> very slowing draining data from the receive buffer), then the single byte JB> of data in the window probe is accepted. However, this can cause rcv_nxt JB> to be greater than rcv_adv. This condition will only last until the next JB> ACK packet is pushed out via tcp_output(), and since the previous ACK JB> advertised a zero window, the ACK should be pushed out while the TCP JB> pcb is write-locked. JB> JB> During the window while rcv_nxt is greather than rcv_adv, a few places JB> would compute the remaining receive window via rcv_adv - rcv_nxt. JB> However, this value was then (uint32_t)-1. On a 64 bit machine this JB> could expand to a positive 2^32 - 1 when cast to a long. In particular, JB> when calculating the receive window in tcp_output(), the result would be JB> that the receive window was computed as 2^32 - 1 resulting in advertising JB> a far larger window to the remote peer than actually existed. JB> JB> Fix various places that compute the remaining receive window to either JB> assert that it is not negative (i.e. rcv_nxt <= rcv_adv), or treat the JB> window as full if rcv_nxt is greather than rcv_adv. JB> JB> Reviewed by: bz JB> MFC after: 1 month JB> Modified: JB> head/sys/netinet/tcp_input.c JB> head/sys/netinet/tcp_output.c JB> head/sys/netinet/tcp_timewait.c JB> Modified: head/sys/netinet/tcp_input.c JB> ============================================================================== JB> --- head/sys/netinet/tcp_input.c Mon May 2 21:04:37 2011 (r221345) JB> +++ head/sys/netinet/tcp_input.c Mon May 2 21:05:52 2011 (r221346) JB> @@ -1831,6 +1831,9 @@ tcp_do_segment(struct mbuf *m, struct tc JB> win = sbspace(&so->so_rcv); JB> if (win < 0) JB> win = 0; JB> + KASSERT(SEQ_GEQ(tp->rcv_adv, tp->rcv_nxt), JB> + ("tcp_input negative window: tp %p rcv_nxt %u rcv_adv %u", tp, JB> + tp->rcv_adv, tp->rcv_nxt)); I am getting this when running tests with HAST (both primary and secondary HAST instances on the same host). HAST is synchronizing data in MAXPHYS (131072 bytes) blocks. The sender splits them on smaller chunks of MAX_SEND_SIZE (32768 bytes), while the receiver receives the whole block calling recv() with MSG_WAITALL option. FreeBSD kopusha.home.net 9.0-CURRENT FreeBSD 9.0-CURRENT #6 r221878: Sat May 14 11:44:42 EEST 2011 root@kopusha.home.net:/usr/obj/usr/src/sys/GENERIC i386 panic: tcp_input negative window: tp 0xc9777ad0 rcv_nxt 1530593650 rcv_adv 1530593651 #0 doadump () at pcpu.h:244 #1 0xc04ddac9 in db_fncall (dummy1=-1063410006, dummy2=0, dummy3=-1, dummy4=0xe67547f0 "\004Huæ") at /usr/src/sys/ddb/db_command.c:548 #2 0xc04ddeff in db_command (last_cmdp=0xc0fcdbfc, cmd_table=0x0, dopager=0) at /usr/src/sys/ddb/db_command.c:445 #3 0xc04ddfb4 in db_command_script (command=0xc0fceb04 "call doadump") at /usr/src/sys/ddb/db_command.c:516 #4 0xc04e2280 in db_script_exec (scriptname=0xe67548fc "kdb.enter.panic", warnifnotfound=Variable "warnifnotfound" is not available. ) at /usr/src/sys/ddb/db_script.c:302 #5 0xc04e2367 in db_script_kdbenter (eventname=0xc0e7d4ee "panic") at /usr/src/sys/ddb/db_script.c:324 #6 0xc04dff98 in db_trap (type=3, code=0) at /usr/src/sys/ddb/db_main.c:228 #7 0xc09da822 in kdb_trap (type=3, code=0, tf=0xe6754a40) at /usr/src/sys/kern/subr_kdb.c:533 #8 0xc0ccc2eb in trap (frame=0xe6754a40) at /usr/src/sys/i386/i386/trap.c:719 #9 0xc0cb4fec in calltrap () at /usr/src/sys/i386/i386/exception.s:168 #10 0xc09da6aa in kdb_enter (why=0xc0e7d4ee "panic", msg=0xc0e7d4ee "panic") at cpufunc.h:71 #11 0xc09a5db4 in panic ( fmt=0xc0ea2d6b "tcp_input negative window: tp %p rcv_nxt %u rcv_adv %u") at /usr/src/sys/kern/kern_shutdown.c:584 #12 0xc0b29cdb in tcp_do_segment (m=0xc9aac800, th=0xc9aac874, so=0xc9e0b680, tp=0xc9777ad0, drop_hdrlen=52, tlen=1, iptos=0 '\0', ti_locked=3) ---Type <return> to continue, or q <return> to quit--- at /usr/src/sys/netinet/tcp_input.c:1834 #13 0xc0b2d309 in tcp_input (m=0xc9aac800, off0=20) at /usr/src/sys/netinet/tcp_input.c:1369 #14 0xc0ac4676 in ip_input (m=0xc9aac800) at /usr/src/sys/netinet/ip_input.c:765 #15 0xc0a6948a in swi_net (arg=0xc1425880) at /usr/src/sys/net/netisr.c:653 #16 0xc097c675 in intr_event_execute_handlers (p=0xc65bf578, ie=0xc6608280) at /usr/src/sys/kern/kern_intr.c:1257 #17 0xc097d559 in ithread_loop (arg=0xc65836a0) at /usr/src/sys/kern/kern_intr.c:1270 #18 0xc0979928 in fork_exit (callout=0xc097d4b0 <ithread_loop>, arg=0xc65836a0, frame=0xe6754d28) at /usr/src/sys/kern/kern_fork.c:920 #19 0xc0cb5064 in fork_trampoline () at /usr/src/sys/i386/i386/exception.s:275 (kgdb) fr 12 #12 0xc0b29cdb in tcp_do_segment (m=0xc9aac800, th=0xc9aac874, so=0xc9e0b680, tp=0xc9777ad0, drop_hdrlen=52, tlen=1, iptos=0 '\0', ti_locked=3) at /usr/src/sys/netinet/tcp_input.c:1834 1834 KASSERT(SEQ_GEQ(tp->rcv_adv, tp->rcv_nxt), (kgdb) p *tp $1 = { t_segq = { lh_first = 0x0 }, t_pspare = {0x0, 0x0}, t_segqlen = 0, t_dupacks = 0, t_timers = 0xc9777cbc, t_inpcb = 0xc8c4cdc0, t_state = 9, t_flags = 525300, t_vnet = 0x0, snd_una = 1602773057, snd_max = 1602773057, snd_nxt = 1602773057, snd_up = 1602773057, snd_wl1 = 1530593650, snd_wl2 = 1602773057, iss = 1602772219, irs = 1499210531, rcv_nxt = 1530593651, rcv_adv = 1530593650, rcv_wnd = 671, ---Type <return> to continue, or q <return> to quit--- rcv_up = 1530593650, snd_wnd = 70872, snd_cwnd = 29510, snd_spare1 = 0, snd_ssthresh = 1073725440, snd_spare2 = 0, snd_recover = 1602773057, t_maxopd = 16344, t_rcvtime = 175861, t_starttime = 171604, t_rtttime = 0, t_rtseq = 1602772220, t_bw_spare1 = 0, t_bw_spare2 = 0, t_rxtcur = 31, t_maxseg = 14336, t_srtt = 56, t_rttvar = 37, t_rxtshift = 0, t_rttmin = 3, t_rttbest = 40, t_rttupdated = 3, max_sndwnd = 71680, ---Type <return> to continue, or q <return> to quit--- t_softerror = 0, t_oobflags = 0 '\0', t_iobc = 0 '\0', snd_scale = 3 '\003', rcv_scale = 3 '\003', request_r_scale = 3 '\003', ts_recent = 175361, ts_recent_age = 175361, ts_offset = 2326744134, last_ack_sent = 1530593651, snd_cwnd_prev = 0, snd_ssthresh_prev = 0, snd_recover_prev = 0, t_sndzerowin = 4, t_badrxtwin = 0, snd_limited = 0 '\0', snd_numholes = 0, snd_holes = { tqh_first = 0x0, tqh_last = 0xc9777bb8 }, snd_fack = 0, rcv_numsacks = 0, ---Type <return> to continue, or q <return> to quit--- sackblks = {{ start = 0, end = 0 }, { start = 0, end = 0 }, { start = 0, end = 0 }, { start = 0, end = 0 }, { start = 0, end = 0 }, { start = 0, end = 0 }}, sack_newdata = 0, sackhint = { nexthole = 0x0, sack_bytes_rexmit = 0, ---Type <return> to continue, or q <return> to quit--- last_sack_ack = 0, ispare = 0, _pad = {0, 0} }, t_rttlow = 0, rfbuf_ts = 175361, rfbuf_cnt = 0, t_tu = 0x0, t_sndrexmitpack = 0, t_rcvoopack = 0, t_toe = 0x0, t_bytes_acked = 0, cc_algo = 0xc0fa6960, ccv = 0xc9777d5c, osd = 0xc9777d74, t_ispare = 0, t_pspare2 = {0x0, 0x0, 0x0, 0x0}, _pad = {0 <repeats 12 times>} } (kgdb) p *so $2 = { so_count = 1, so_type = 1, so_options = 4, so_linger = 0, so_state = 2, so_qstate = 0, so_pcb = 0xc8c4cdc0, so_vnet = 0x0, so_proto = 0xc0fa56a8, so_head = 0x0, so_incomp = { tqh_first = 0x0, tqh_last = 0x0 }, so_comp = { tqh_first = 0x0, tqh_last = 0x0 }, so_list = { tqe_next = 0x0, tqe_prev = 0xc9e0ab88 }, ---Type <return> to continue, or q <return> to quit--- so_qlen = 0, so_incqlen = 0, so_qlimit = 0, so_timeo = 0, so_error = 0, so_sigio = 0x0, so_oobmark = 0, so_aiojobq = { tqh_first = 0x0, tqh_last = 0xc9e0b6cc }, so_rcv = { sb_sel = { si_tdlist = { tqh_first = 0x0, tqh_last = 0x0 }, si_note = { kl_list = { slh_first = 0x0 }, kl_lock = 0xc0970a90 <knlist_mtx_lock>, kl_unlock = 0xc0970ac0 <knlist_mtx_unlock>, ---Type <return> to continue, or q <return> to quit--- kl_assert_locked = 0xc0970e10 <knlist_mtx_assert_locked>, kl_assert_unlocked = 0xc0970de0 <knlist_mtx_assert_unlocked>, kl_lockarg = 0xc9e0b6f8 }, si_mtx = 0x0 }, sb_mtx = { lock_object = { lo_name = 0xc0e84d6b "so_rcv", lo_flags = 16973824, lo_data = 0, lo_witness = 0xc655a618 }, mtx_lock = 4 }, sb_sx = { lock_object = { lo_name = 0xc0e75c2c "so_rcv_sx", lo_flags = 36896768, lo_data = 0, lo_witness = 0xc65634b0 }, sx_lock = 3356203168 ---Type <return> to continue, or q <return> to quit--- }, sb_state = 0, sb_mb = 0xc9aab800, sb_mbtail = 0xc88f0000, sb_lastrecord = 0xc9aab800, sb_sndptr = 0x0, sb_sndptroff = 0, sb_cc = 71010, sb_hiwat = 71680, sb_mbcnt = 85760, sb_mcnt = 23, sb_ccnt = 20, sb_mbmax = 262144, sb_ctl = 0, sb_lowat = 1, sb_timeo = 2000, sb_flags = 2052, sb_upcall = 0, sb_upcallarg = 0x0 }, so_snd = { sb_sel = { si_tdlist = { ---Type <return> to continue, or q <return> to quit--- tqh_first = 0x0, tqh_last = 0x0 }, si_note = { kl_list = { slh_first = 0x0 }, kl_lock = 0xc0970a90 <knlist_mtx_lock>, kl_unlock = 0xc0970ac0 <knlist_mtx_unlock>, kl_assert_locked = 0xc0970e10 <knlist_mtx_assert_locked>, kl_assert_unlocked = 0xc0970de0 <knlist_mtx_assert_unlocked>, kl_lockarg = 0xc9e0b78c }, si_mtx = 0x0 }, sb_mtx = { lock_object = { lo_name = 0xc0e84d64 "so_snd", lo_flags = 16973824, lo_data = 0, lo_witness = 0xc655a5b0 }, mtx_lock = 4 ---Type <return> to continue, or q <return> to quit--- }, sb_sx = { lock_object = { lo_name = 0xc0e75c22 "so_snd_sx", lo_flags = 36896768, lo_data = 0, lo_witness = 0xc6563448 }, sx_lock = 1 }, sb_state = 16, sb_mb = 0x0, sb_mbtail = 0x0, sb_lastrecord = 0x0, sb_sndptr = 0x0, sb_sndptroff = 0, sb_cc = 0, sb_hiwat = 43008, sb_mbcnt = 0, sb_mcnt = 0, sb_ccnt = 0, sb_mbmax = 262144, sb_ctl = 0, ---Type <return> to continue, or q <return> to quit--- sb_lowat = 2048, sb_timeo = 2000, sb_flags = 2048, sb_upcall = 0, sb_upcallarg = 0x0 }, so_cred = 0xc879b700, so_label = 0x0, so_peerlabel = 0x0, so_gencnt = 1987, so_emuldata = 0x0, so_accf = 0x0, so_fibnum = 0, so_user_cookie = 0 } (kgdb) JB> tp->rcv_wnd = imax(win, (int)(tp->rcv_adv - tp->rcv_nxt)); JB> JB> /* Reset receive buffer auto scaling when not in bulk receive mode. */ JB> @@ -2868,7 +2871,10 @@ dodata: /* XXX */ JB> * buffer size. JB> * XXX: Unused. JB> */ JB> - len = so->so_rcv.sb_hiwat - (tp->rcv_adv - tp->rcv_nxt); JB> + if (SEQ_GT(tp->rcv_adv, tp->rcv_nxt)) JB> + len = so->so_rcv.sb_hiwat - (tp->rcv_adv - tp->rcv_nxt); JB> + else JB> + len = so->so_rcv.sb_hiwat; JB> #endif JB> } else { JB> m_freem(m); JB> Modified: head/sys/netinet/tcp_output.c JB> ============================================================================== JB> --- head/sys/netinet/tcp_output.c Mon May 2 21:04:37 2011 (r221345) JB> +++ head/sys/netinet/tcp_output.c Mon May 2 21:05:52 2011 (r221346) JB> @@ -561,15 +561,21 @@ after_sack_rexmit: JB> * taking into account that we are limited by JB> * TCP_MAXWIN << tp->rcv_scale. JB> */ JB> - long adv = min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) - JB> - (tp->rcv_adv - tp->rcv_nxt); JB> + long adv; JB> + int oldwin; JB> + JB> + adv = min(recwin, (long)TCP_MAXWIN << tp->rcv_scale); JB> + if (SEQ_GT(tp->rcv_adv, tp->rcv_nxt)) { JB> + oldwin = (tp->rcv_adv - tp->rcv_nxt); JB> + adv -= oldwin; JB> + } else JB> + oldwin = 0; JB> JB> /* JB> * If the new window size ends up being the same as the old JB> * size when it is scaled, then don't force a window update. JB> */ JB> - if ((tp->rcv_adv - tp->rcv_nxt) >> tp->rcv_scale == JB> - (adv + tp->rcv_adv - tp->rcv_nxt) >> tp->rcv_scale) JB> + if (oldwin >> tp->rcv_scale == (adv + oldwin) >> tp->rcv_scale) JB> goto dontupdate; JB> if (adv >= (long) (2 * tp->t_maxseg)) JB> goto send; JB> @@ -1008,7 +1014,8 @@ send: JB> if (recwin < (long)(so->so_rcv.sb_hiwat / 4) && JB> recwin < (long)tp->t_maxseg) JB> recwin = 0; JB> - if (recwin < (long)(tp->rcv_adv - tp->rcv_nxt)) JB> + if (SEQ_GT(tp->rcv_adv, tp->rcv_nxt) && JB> + recwin < (long)(tp->rcv_adv - tp->rcv_nxt)) JB> recwin = (long)(tp->rcv_adv - tp->rcv_nxt); JB> if (recwin > (long)TCP_MAXWIN << tp->rcv_scale) JB> recwin = (long)TCP_MAXWIN << tp->rcv_scale; JB> Modified: head/sys/netinet/tcp_timewait.c JB> ============================================================================== JB> --- head/sys/netinet/tcp_timewait.c Mon May 2 21:04:37 2011 (r221345) JB> +++ head/sys/netinet/tcp_timewait.c Mon May 2 21:05:52 2011 (r221346) JB> @@ -242,6 +242,9 @@ tcp_twstart(struct tcpcb *tp) JB> /* JB> * Recover last window size sent. JB> */ JB> + KASSERT(SEQ_GEQ(tp->rcv_adv, tp->rcv_nxt), JB> + ("tcp_twstart negative window: tp %p rcv_nxt %u rcv_adv %u", tp, JB> + tp->rcv_adv, tp->rcv_nxt)); JB> tw->last_win = (tp->rcv_adv - tp->rcv_nxt) >> tp->rcv_scale; JB> JB> /* -- Mikolaj Golub
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?86pqnlbmao.fsf>