From owner-dev-commits-src-all@freebsd.org Wed Jun 9 00:01:17 2021 Return-Path: Delivered-To: dev-commits-src-all@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 7B6F9640CB4; Wed, 9 Jun 2021 00:01:17 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4G06hY35wNz3lWs; Wed, 9 Jun 2021 00:01:17 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from gitrepo.freebsd.org (gitrepo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:5]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 510C112361; Wed, 9 Jun 2021 00:01:17 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from gitrepo.freebsd.org ([127.0.1.44]) by gitrepo.freebsd.org (8.16.1/8.16.1) with ESMTP id 15901Hgh053521; Wed, 9 Jun 2021 00:01:17 GMT (envelope-from git@gitrepo.freebsd.org) Received: (from git@localhost) by gitrepo.freebsd.org (8.16.1/8.16.1/Submit) id 15901HgD053520; Wed, 9 Jun 2021 00:01:17 GMT (envelope-from git) Date: Wed, 9 Jun 2021 00:01:17 GMT Message-Id: <202106090001.15901HgD053520@gitrepo.freebsd.org> To: src-committers@FreeBSD.org, dev-commits-src-all@FreeBSD.org, dev-commits-src-branches@FreeBSD.org From: Michael Tuexen Subject: git: 87cf5dcc3335 - stable/13 - tcp:Host cache and rack ending up with incorrect values. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Git-Committer: tuexen X-Git-Repository: src X-Git-Refname: refs/heads/stable/13 X-Git-Reftype: branch X-Git-Commit: 87cf5dcc3335de32661ae75471f29bc387de194e Auto-Submitted: auto-generated X-BeenThere: dev-commits-src-all@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Commit messages for all branches of the src repository List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 09 Jun 2021 00:01:17 -0000 The branch stable/13 has been updated by tuexen: URL: https://cgit.FreeBSD.org/src/commit/?id=87cf5dcc3335de32661ae75471f29bc387de194e commit 87cf5dcc3335de32661ae75471f29bc387de194e Author: Randall Stewart AuthorDate: 2021-05-10 15:25:51 +0000 Commit: Michael Tuexen CommitDate: 2021-06-09 00:00:27 +0000 tcp:Host cache and rack ending up with incorrect values. The hostcache up to now as been updated in the discard callback but without checking if we are all done (the race where there are more than one calls and the counter has not yet reached zero). This means that when the race occurs, we end up calling the hc_upate more than once. Also alternate stacks can keep there srtt/rttvar in different formats (example rack keeps its values in microseconds). Since we call the hc_update *before* the stack fini() then the values will be in the wrong format. Rack on the other hand, needs to convert items pulled from the hostcache into its internal format else it may end up with very much incorrect values from the hostcache. In the process lets commonize the update mechanism for srtt/rttvar since we now have more than one place that needs to call it. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30172 (cherry picked from commit 9867224bab3f247ac875d89c2472aa4bc855fe3b) --- sys/netinet/tcp_stacks/rack.c | 102 ++++++++++++++++++++---------------- sys/netinet/tcp_subr.c | 118 ++++++++++++++++++++++-------------------- 2 files changed, 119 insertions(+), 101 deletions(-) diff --git a/sys/netinet/tcp_stacks/rack.c b/sys/netinet/tcp_stacks/rack.c index 110304d29cc0..7bb77d8158af 100644 --- a/sys/netinet/tcp_stacks/rack.c +++ b/sys/netinet/tcp_stacks/rack.c @@ -6563,13 +6563,65 @@ rack_remxt_tmr(struct tcpcb *tp) rack->r_ctl.rc_snd_max_at_rto = tp->snd_max; } +static void +rack_convert_rtts(struct tcpcb *tp) +{ + if (tp->t_srtt > 1) { + uint32_t val, frac; + + val = tp->t_srtt >> TCP_RTT_SHIFT; + frac = tp->t_srtt & 0x1f; + tp->t_srtt = TICKS_2_USEC(val); + /* + * frac is the fractional part of the srtt (if any) + * but its in ticks and every bit represents + * 1/32nd of a hz. + */ + if (frac) { + if (hz == 1000) { + frac = (((uint64_t)frac * (uint64_t)HPTS_USEC_IN_MSEC) / (uint64_t)TCP_RTT_SCALE); + } else { + frac = (((uint64_t)frac * (uint64_t)HPTS_USEC_IN_SEC) / ((uint64_t)(hz) * (uint64_t)TCP_RTT_SCALE)); + } + tp->t_srtt += frac; + } + } + if (tp->t_rttvar) { + uint32_t val, frac; + + val = tp->t_rttvar >> TCP_RTTVAR_SHIFT; + frac = tp->t_rttvar & 0x1f; + tp->t_rttvar = TICKS_2_USEC(val); + /* + * frac is the fractional part of the srtt (if any) + * but its in ticks and every bit represents + * 1/32nd of a hz. + */ + if (frac) { + if (hz == 1000) { + frac = (((uint64_t)frac * (uint64_t)HPTS_USEC_IN_MSEC) / (uint64_t)TCP_RTT_SCALE); + } else { + frac = (((uint64_t)frac * (uint64_t)HPTS_USEC_IN_SEC) / ((uint64_t)(hz) * (uint64_t)TCP_RTT_SCALE)); + } + tp->t_rttvar += frac; + } + } + RACK_TCPT_RANGESET(tp->t_rxtcur, RACK_REXMTVAL(tp), + rack_rto_min, rack_rto_max); +} + static void rack_cc_conn_init(struct tcpcb *tp) { struct tcp_rack *rack; rack = (struct tcp_rack *)tp->t_fb_ptr; + cc_conn_init(tp); + /* + * Now convert to rack's internal format. + */ + rack_convert_rtts(tp); /* * We want a chance to stay in slowstart as * we create a connection. TCP spec says that @@ -11916,9 +11968,9 @@ rack_init_fsb_block(struct tcpcb *tp, struct tcp_rack *rack) static int rack_init_fsb(struct tcpcb *tp, struct tcp_rack *rack) { - /* - * Allocate the larger of spaces V6 if available else just - * V4 and include udphdr (overbook) + /* + * Allocate the larger of spaces V6 if available else just + * V4 and include udphdr (overbook) */ #ifdef INET6 rack->r_ctl.fsb.tcp_ip_hdr_len = sizeof(struct ip6_hdr) + sizeof(struct tcphdr) + sizeof(struct udphdr); @@ -12147,47 +12199,7 @@ rack_init(struct tcpcb *tp) * bit decimal so we have to carefully convert * these to get the full precision. */ - if (tp->t_srtt > 1) { - uint32_t val, frac; - - val = tp->t_srtt >> TCP_RTT_SHIFT; - frac = tp->t_srtt & 0x1f; - tp->t_srtt = TICKS_2_USEC(val); - /* - * frac is the fractional part of the srtt (if any) - * but its in ticks and every bit represents - * 1/32nd of a hz. - */ - if (frac) { - if (hz == 1000) { - frac = (((uint64_t)frac * (uint64_t)HPTS_USEC_IN_MSEC) / (uint64_t)TCP_RTT_SCALE); - } else { - frac = (((uint64_t)frac * (uint64_t)HPTS_USEC_IN_SEC) / ((uint64_t)(hz) * (uint64_t)TCP_RTT_SCALE)); - } - tp->t_srtt += frac; - } - } - if (tp->t_rttvar) { - uint32_t val, frac; - - val = tp->t_rttvar >> TCP_RTTVAR_SHIFT; - frac = tp->t_rttvar & 0x1f; - tp->t_rttvar = TICKS_2_USEC(val); - /* - * frac is the fractional part of the srtt (if any) - * but its in ticks and every bit represents - * 1/32nd of a hz. - */ - if (frac) { - if (hz == 1000) { - frac = (((uint64_t)frac * (uint64_t)HPTS_USEC_IN_MSEC) / (uint64_t)TCP_RTT_SCALE); - } else { - frac = (((uint64_t)frac * (uint64_t)HPTS_USEC_IN_SEC) / ((uint64_t)(hz) * (uint64_t)TCP_RTT_SCALE)); - } - tp->t_rttvar += frac; - } - } - tp->t_rxtcur = TICKS_2_USEC(tp->t_rxtcur); + rack_convert_rtts(tp); tp->t_rttlow = TICKS_2_USEC(tp->t_rttlow); if (rack_def_profile) rack_set_profile(rack, rack_def_profile); @@ -12230,7 +12242,7 @@ rack_init(struct tcpcb *tp) rack_stop_all_timers(tp); /* Lets setup the fsb block */ rack_start_hpts_timer(rack, tp, tcp_get_usecs(NULL), 0, 0, 0); - rack_log_rtt_shrinks(rack, us_cts, 0, + rack_log_rtt_shrinks(rack, us_cts, tp->t_rxtcur, __LINE__, RACK_RTTS_INIT); return (0); } diff --git a/sys/netinet/tcp_subr.c b/sys/netinet/tcp_subr.c index 93445c636d41..b9da908d2a15 100644 --- a/sys/netinet/tcp_subr.c +++ b/sys/netinet/tcp_subr.c @@ -2305,62 +2305,6 @@ tcp_discardcb(struct tcpcb *tp) tp->t_fb->tfb_tcp_timer_stop_all(tp); } - /* - * If we got enough samples through the srtt filter, - * save the rtt and rttvar in the routing entry. - * 'Enough' is arbitrarily defined as 4 rtt samples. - * 4 samples is enough for the srtt filter to converge - * to within enough % of the correct value; fewer samples - * and we could save a bogus rtt. The danger is not high - * as tcp quickly recovers from everything. - * XXX: Works very well but needs some more statistics! - */ - if (tp->t_rttupdated >= 4) { - struct hc_metrics_lite metrics; - uint32_t ssthresh; - - bzero(&metrics, sizeof(metrics)); - /* - * Update the ssthresh always when the conditions below - * are satisfied. This gives us better new start value - * for the congestion avoidance for new connections. - * ssthresh is only set if packet loss occurred on a session. - * - * XXXRW: 'so' may be NULL here, and/or socket buffer may be - * being torn down. Ideally this code would not use 'so'. - */ - ssthresh = tp->snd_ssthresh; - if (ssthresh != 0 && ssthresh < so->so_snd.sb_hiwat / 2) { - /* - * convert the limit from user data bytes to - * packets then to packet data bytes. - */ - ssthresh = (ssthresh + tp->t_maxseg / 2) / tp->t_maxseg; - if (ssthresh < 2) - ssthresh = 2; - ssthresh *= (tp->t_maxseg + -#ifdef INET6 - (isipv6 ? sizeof (struct ip6_hdr) + - sizeof (struct tcphdr) : -#endif - sizeof (struct tcpiphdr) -#ifdef INET6 - ) -#endif - ); - } else - ssthresh = 0; - metrics.rmx_ssthresh = ssthresh; - - metrics.rmx_rtt = tp->t_srtt; - metrics.rmx_rttvar = tp->t_rttvar; - metrics.rmx_cwnd = tp->snd_cwnd; - metrics.rmx_sendpipe = 0; - metrics.rmx_recvpipe = 0; - - tcp_hc_update(&inp->inp_inc, &metrics); - } - /* free the reassembly queue, if any */ tcp_reass_flush(tp); @@ -2400,6 +2344,68 @@ tcp_discardcb(struct tcpcb *tp) TCPSTATES_DEC(tp->t_state); if (tp->t_fb->tfb_tcp_fb_fini) (*tp->t_fb->tfb_tcp_fb_fini)(tp, 1); + + /* + * If we got enough samples through the srtt filter, + * save the rtt and rttvar in the routing entry. + * 'Enough' is arbitrarily defined as 4 rtt samples. + * 4 samples is enough for the srtt filter to converge + * to within enough % of the correct value; fewer samples + * and we could save a bogus rtt. The danger is not high + * as tcp quickly recovers from everything. + * XXX: Works very well but needs some more statistics! + * + * XXXRRS: Updating must be after the stack fini() since + * that may be converting some internal representation of + * say srtt etc into the general one used by other stacks. + * Lets also at least protect against the so being NULL + * as RW stated below. + */ + if ((tp->t_rttupdated >= 4) && (so != NULL)) { + struct hc_metrics_lite metrics; + uint32_t ssthresh; + + bzero(&metrics, sizeof(metrics)); + /* + * Update the ssthresh always when the conditions below + * are satisfied. This gives us better new start value + * for the congestion avoidance for new connections. + * ssthresh is only set if packet loss occurred on a session. + * + * XXXRW: 'so' may be NULL here, and/or socket buffer may be + * being torn down. Ideally this code would not use 'so'. + */ + ssthresh = tp->snd_ssthresh; + if (ssthresh != 0 && ssthresh < so->so_snd.sb_hiwat / 2) { + /* + * convert the limit from user data bytes to + * packets then to packet data bytes. + */ + ssthresh = (ssthresh + tp->t_maxseg / 2) / tp->t_maxseg; + if (ssthresh < 2) + ssthresh = 2; + ssthresh *= (tp->t_maxseg + +#ifdef INET6 + (isipv6 ? sizeof (struct ip6_hdr) + + sizeof (struct tcphdr) : +#endif + sizeof (struct tcpiphdr) +#ifdef INET6 + ) +#endif + ); + } else + ssthresh = 0; + metrics.rmx_ssthresh = ssthresh; + + metrics.rmx_rtt = tp->t_srtt; + metrics.rmx_rttvar = tp->t_rttvar; + metrics.rmx_cwnd = tp->snd_cwnd; + metrics.rmx_sendpipe = 0; + metrics.rmx_recvpipe = 0; + + tcp_hc_update(&inp->inp_inc, &metrics); + } refcount_release(&tp->t_fb->tfb_refcnt); tp->t_inpcb = NULL; uma_zfree(V_tcpcb_zone, tp);