Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 2 May 2023 23:49:45 -0700
From:      Chen Shuo <chenshuo@chenshuo.com>
To:        Hans Petter Selasky <hps@selasky.org>
Cc:        freebsd-net@freebsd.org
Subject:   Re: Cwnd grows slowly during slow-start due to LRO of the receiver side.
Message-ID:  <CAKZ7KuLdXpTU2%2BoMHJipXo6Frx=AocMk7oCjymLMbYy=FfZP7g@mail.gmail.com>
In-Reply-To: <656f2daa-53a2-40d2-5fdc-b570473d56bc@selasky.org>
References:  <CAKZ7KuJdwaUZJrb4=XdQrVo_Nq2T%2B%2BzN_B3PT9W5Kh7ORx26HQ@mail.gmail.com> <d83ef6de-f3c8-3e1e-93a8-6a81799acac6@selasky.org> <656f2daa-53a2-40d2-5fdc-b570473d56bc@selasky.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Hi Hans,

Thanks for replying and suggestions.

> Have you tested using FreeBSD main / 14 ?

I tested 14.0-CURRENT built on 2023-04-27, it is indeed much improved.
Now the TCP sender reaches 100Mbps in 4 seconds on a link with 100ms delay.

% uname -a
FreeBSD 14.0-CURRENT #0 main-n262599-60167184abd5: Thu Apr 27 08:09:50 UTC =
2023

schen@freebsd14:~/recipes/tpc % bin/tcpperf -c 192.168.0.1 -t 6
Connected 192.168.0.100:59302 -> 192.168.0.1:2009, congestion control: cubi=
c
Time (s)  Throughput   Bitrate    Cwnd    Rwnd  sndbuf  ssthresh  rtt/var
  0.000s   0.00kB/s   0.00kbps  14.1Ki  63.6Ki  32.8Ki    1024Mi  97.8ms/25=
00
  1.014s    776kB/s   6205kbps   166Ki   992Ki   313Ki    1024Mi  100.0ms/1=
875
  2.021s   3643kB/s   29.1Mbps   495Ki  1491Ki  1017Ki    1024Mi  100.0ms/1=
875
  3.029s   7544kB/s   60.3Mbps   932Ki  2096Ki  1817Ki    1024Mi  100.0ms/1=
875
  4.036s   12.9MB/s    103Mbps  1729Ki  3064Ki  1817Ki    1024Mi  100.0ms/1=
875
  5.046s   18.2MB/s    145Mbps  2606Ki  3056Ki  1817Ki    1024Mi  96.9ms/68=
75
  6.090s   17.8MB/s    143Mbps  3074Ki  2974Ki  1817Ki    1024Mi  113.4ms/1=
1250
Sender   transferred 62.0MBytes in 6.090s, throughput: 10.2MBytes/s, 81.4Mb=
its/s
Receiver transferred 62.0MBytes in 6.191s, throughput: 10.0MBytes/s, 80.1Mb=
its/s

Cwnd increased much faster than 13.2-RELEASE.
Since 5-th second, the throughput is limited by sndbuf, 1817Ki / 100ms
=3D 18.2MB/s

Interestingly, it's not due to lro_nsegs, but a side effect of
https://reviews.freebsd.org/D32693.
Namely, the one line change fixed (or vastly improved) the slow-start in 13=
.x:

--- a/usr/src/sys/conf/files 2023-04-06 17:34:41.000000000 -0700
+++ b/usr/src/sys/conf/files 2023-05-02 23:00:38.000000000 -0700
@@ -4412,6 +4412,7 @@
 netinet/raw_ip.c optional inet | inet6
 netinet/cc/cc.c optional inet | inet6
 netinet/cc/cc_newreno.c optional inet | inet6
+netinet/khelp/h_ertt.c optional inet | inet6
 netinet/sctp_asconf.c optional inet sctp | inet6 sctp
 netinet/sctp_auth.c optional inet sctp | inet6 sctp
 netinet/sctp_bsd_addr.c optional inet sctp | inet6 sctp

Here's the tcpdump after compiling netinet/khelp/h_ertt.c into 13.x
kernel by default:

 0.000 IP src > sink: Flags [S], seq 392582262, win 65535, options
[mss 1460,nop,wscale 6,sackOK,TS val 840935345 ecr 0], length 0
 0.100 IP sink > src: Flags [S.], seq 3065702766, ack 392582263, win
65160, options [mss 1460,sackOK,TS val 408756323 ecr
840935345,nop,wscale 7], length 0
 0.100 IP src > sink: Flags [.], ack 1, win 1027, options [nop,nop,TS
val 840935450 ecr 408756323], length 0

 // First round-trip: cwnd =3D 10 * MSS
 0.101 IP src > sink: [.], seq 1:14481, ack 1, win 1027, length 14480
 0.201 IP sink > src: [.], ack 14481, win 445, length 0

 // cwnd +=3D 2 * MSS, but sent two segments, for better RTT calculation
 0.201 IP src > sink: [.], seq 14481:15929, ack 1, win 1027, length 1448
 0.202 IP src > sink: [.], seq 15929:31857, ack 1, win 1027, length 15928
 // cwnd =3D=3D 12 here

 // Got ACK for the 1448 segment, cwnd +=3D 1 * MSS, sent two more segs.
 0.302 IP sink > src: [.], ack 15929, win 501, length 0
 0.302 IP src > sink: [.], seq 31857:33305, ack 1, win 1027, length 1448
 0.302 IP src > sink: [.], seq 33305:34753, ack 1, win 1027, length 1448
 // cwnd =3D=3D 13 here

 // Got ACK for the 15928 segment, cwnd +=3D 2 * MSS, sent 13-MSS segment
 0.302 IP sink > src: [.], ack 31857, win 440, length 0
 0.302 IP src > sink: [.], seq 34753:53577, ack 1, win 1027, length 18824
 // cwnd =3D=3D 15 here, bytes in flight =3D 15 * MSS

 // ACK of 1448 bytes, sent two more segments, typical slow-start
 0.403 IP sink > src: [.], ack 33305, win 501, length 0
 0.403 IP src > sink: [.], seq 53577:55025, ack 1, win 1027, length 1448
 0.403 IP src > sink: [.], seq 55025:56473, ack 1, win 1027, length 1448
 // ACK of 1448 bytes, sent 2-MSS segment, typical slow-start with TSO
 0.403 IP sink > src: [.], ack 34753, win 496, length 0
 0.403 IP src > sink: [.], seq 56473:59369, ack 1, win 1027, length 2896
 // cwnd =3D=3D 17 here

 // ACK of 18824, cwnd +=3D 2 * MSS, sent 15-MSS segment
 0.403 IP sink > src: [.], ack 53577, win 795, length 0
 0.403 IP src > sink: [.], seq 59369:81089, ack 1, win 1027, length 21720
 // cwnd =3D=3D 19 here, bytes in flight =3D 19 * MSS

marked_packet_rtt() in h_ertt.c sometimes turns off TSO for better RTT meas=
ure,
resulting in more segments being sent, and more ACK received, then
cwnd could increase faster.

It really sounds like a butterfly effect to me.

Regards,
Shuo

On Tue, May 2, 2023 at 3:04=E2=80=AFAM Hans Petter Selasky <hps@selasky.org=
> wrote:
>
> On 5/2/23 11:14, Hans Petter Selasky wrote:
> > Hi Chen!
> >
> > The FreeBSD mbufs carry the number of ACKs that have been joined
> > together into the following field:
> >
> > m->m_pkthdr.lro_nsegs
> >
> > Can this value be of any use to cc_newreno ?
> >
> > --HPS
>
> Hi Chen,
>
> Have you tested using FreeBSD main / 14 ?
>
> The "nsegs" are passed along like this:
>
> nsegs =3D max(1, m->m_pkthdr.lro_nsegs);
>
> ...
>
> cc_ack_received(tp, th, nsegs, CC_ACK);
>
> ...
>
> (Newreno - FreeBSD-14)
>
>                                  incr =3D min(ccv->bytes_this_ack,
>                                      ccv->nsegs * abc_val *
>                                      CCV(ccv, t_maxseg));
>
> And in FreeBSD-10 being mentioned in your article:
>
> (Newreno - FreeBSD-10)
>
>                                  incr =3D min(ccv->bytes_this_ack,
>                                      V_tcp_abc_l_var * CCV(ccv, t_maxseg)=
);
>
>
> There is no such thing.
>
> This issue may already have been fixed!
>
> --HPS
> >
> > On 5/2/23 09:46, Chen Shuo wrote:
> >> As per newreno_ack_received() in sys/netinet/cc/cc_newreno.c,
> >> FreeBSD TCP sender strictly follows RFC 5681 with RFC 3465 extension
> >> That is, during slow-start, when receiving an ACK of 'bytes_acked'
> >>
> >>      cwnd +=3D min(bytes_acked, abc_l_var * SMSS);  // abc_l_var =3D 2=
 dflt
> >>
> >> As discussed in sec3.2 of RFC 3465, L=3D2*SMSS bytes exactly balances
> >> the negative impact of the delayed ACK algorithm.  RFC 5681 also
> >> requires that a receiver SHOULD generate an ACK for at least every
> >> second full-sized segment, so bytes_acked per ACK is at most 2 * SMSS.
> >> If both sender and receiver follow it. cwnd should grow exponentially
> >> during slow-slow:
> >>
> >>      cwnd *=3D 2    (per RTT)
> >>
> >> However, LRO and TSO are widely used today, so receiver may generate
> >> much less ACKs than it used to do.  As I observed, Both FreeBSD and
> >> Linux generates at most one ACK per segment assembled by LRO/GRO.
> >> The worst case is one ACK per 45 MSS, as 45 * 1448 =3D 65160 < 65535.
> >>
> >> Sending 1MB over a link of 100ms delay from FreeBSD 13.2:
> >>
> >>   0.000 IP sender > sink: Flags [S], seq 205083268, win 65535, options
> >> [mss 1460,nop,wscale 10,sackOK,TS val 495212525 ecr 0], length 0
> >>   0.100 IP sink > sender: Flags [S.], seq 708257395, ack 205083269, wi=
n
> >> 65160, options [mss 1460,sackOK,TS val 563185696 ecr
> >> 495212525,nop,wscale 7], length 0
> >>   0.100 IP sender > sink: Flags [.], ack 1, win 65, options [nop,nop,T=
S
> >> val 495212626 ecr 563185696], length 0
> >>   // TSopt omitted below for brevity.
> >>
> >>   // cwnd =3D 10 * MSS, sent 10 * MSS
> >>   0.101 IP sender > sink: Flags [.], seq 1:14481, ack 1, win 65,
> >> length 14480
> >>
> >>   // got one ACK for 10 * MSS, cwnd +=3D 2 * MSS, sent 12 * MSS
> >>   0.201 IP sink > sender: Flags [.], ack 14481, win 427, length 0
> >>   0.201 IP sender > sink: Flags [.], seq 14481:31857, ack 1, win 65,
> >> length 17376
> >>
> >>   // got ACK of 12*MSS above, cwnd +=3D 2 * MSS, sent 14 * MSS
> >>   0.301 IP sink > sender: Flags [.], ack 31857, win 411, length 0
> >>   0.301 IP sender > sink: Flags [.], seq 31857:52129, ack 1, win 65,
> >> length 20272
> >>
> >>   // got ACK of 14*MSS above, cwnd +=3D 2 * MSS, sent 16 * MSS
> >>   0.402 IP sink > sender: Flags [.], ack 52129, win 395, length 0
> >>   0.402 IP sender > sink: Flags [P.], seq 52129:73629, ack 1, win 65,
> >> length 21500
> >>   0.402 IP sender > sink: Flags [.], seq 73629:75077, ack 1, win 65,
> >> length 1448
> >>
> >> As a consequence, instead of growing exponentially, cwnd grows
> >> more-or-less quadratically during slow-start, unless abc_l_var is
> >> set to a sufficiently large value.
> >>
> >> NewReno took more than 20 seconds to ramp up throughput to 100Mbps
> >> over an emulated 100ms delay link.  While Linux took ~2 seconds.
> >> I can provide the pcap file if anyone is interested.
> >>
> >> Switching to CUBIC won't help, because it uses the logic in NewReno
> >> ack_received() for slow start.
> >>
> >> Is this a well-known issue and abc_l_var is the only cure for it?
> >> https://calomel.org/freebsd_network_tuning.html
> >>
> >> Thank you!
> >>
> >> Best,
> >> Shuo Chen
> >>
> >
> >
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAKZ7KuLdXpTU2%2BoMHJipXo6Frx=AocMk7oCjymLMbYy=FfZP7g>