FreeBSD Mail Archives

Date:      Sun, 9 Apr 2023 10:58:43 -0400
From:      Cheng Cui <cc@freebsd.org>
To:        "Scheffenegger, Richard" <Richard.Scheffenegger@netapp.com>
Cc:        "Rodney W. Grimes" <freebsd-rwg@gndrsh.dnsmgr.net>, Richard Perini <rpp@ci.com.au>,  "freebsd-hackers@FreeBSD.org" <freebsd-hackers@freebsd.org>, "rscheff@FreeBSD.org" <rscheff@freebsd.org>,  "tuexen@freebsd.org" <tuexen@freebsd.org>,  "<freebsd-transport@freebsd.org>" <freebsd-transport@freebsd.org>
Subject:   Re: low TCP speed, wrong rtt measurement
Message-ID:  <CAGaXuiJThYFMfw4%2BjFM-pxkHvyfg4XPLp=2zf-fT_u33eHP%2Bwg@mail.gmail.com>
In-Reply-To: <PH0PR06MB7639318063FF5D640D16105486949@PH0PR06MB7639.namprd06.prod.outlook.com>
References:  <ZDHuz%2B/p3EemMnK7@jodi.ci.com.au> <202304090058.3390wrE1020757@gndrsh.dnsmgr.net> <PH0PR06MB7639318063FF5D640D16105486949@PH0PR06MB7639.namprd06.prod.outlook.com>

--0000000000003fb16405f8e87f20
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

First of all, we need to make sure there are TCP retransmissions that are
caused by packet loss.
Otherwise, TCP congestion control or cwnd is irrelevant.

Some tests like below from iperf3 or "netstat -s" can report TCP
retransmissions.

For example, over a 20ms link, the theoretical max cwnd size is determined
by the
Bandwidth Delay Product (BDP):
20ms x 10Mb/s =3D 25000 Bytes (around 25KB)

cc@s1:~ % ping -c 3 r1
PING r1-link1 (10.1.1.3): 56 data bytes
64 bytes from 10.1.1.3: icmp_seq=3D0 ttl=3D64 time=3D19.807 ms
64 bytes from 10.1.1.3: icmp_seq=3D1 ttl=3D64 time=3D19.387 ms
64 bytes from 10.1.1.3: icmp_seq=3D2 ttl=3D64 time=3D19.488 ms

--- r1-link1 ping statistics ---
3 packets transmitted, 3 packets received, 0.0% packet loss
round-trip min/avg/max/stddev =3D 19.387/19.561/19.807/0.179 ms

before test:
cc@s1:~ % netstat -sp tcp | egrep "tcp:|retrans|SACK"
tcp:
0 data packets (0 bytes) retransmitted
0 data packets unnecessarily retransmitted
0 retransmit timeouts
0 retransmitted
0 SACK recovery episodes
0 segment rexmits in SACK recovery episodes
0 byte rexmits in SACK recovery episodes
0 SACK options (SACK blocks) received
0 SACK options (SACK blocks) sent
0 SACK retransmissions lost
0 SACK scoreboard overflow

cc@s1:~ % iperf3 -c r1 -t 5 -i 1
Connecting to host r1, port 5201
[  5] local 10.1.1.2 port 49487 connected to 10.1.1.3 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  2.58 MBytes  21.7 Mbits/sec    7   11.3 KBytes

[  5]   1.00-2.00   sec  1.39 MBytes  11.7 Mbits/sec    2   31.0 KBytes

[  5]   2.00-3.00   sec  1.14 MBytes  9.59 Mbits/sec    4   24.1 KBytes

[  5]   3.00-4.00   sec  1.01 MBytes  8.48 Mbits/sec    3   30.4 KBytes

[  5]   4.00-5.00   sec  1.33 MBytes  11.2 Mbits/sec    4   23.0 KBytes

- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-5.00   sec  7.46 MBytes  12.5 Mbits/sec   20             sende=
r
[  5]   0.00-5.02   sec  7.23 MBytes  12.1 Mbits/sec
 receiver

iperf Done.

after test:
cc@s1:~ % netstat -sp tcp | egrep "tcp:|retrans|SACK"
tcp:
20 data packets (28960 bytes) retransmitted            <<
0 data packets unnecessarily retransmitted
0 retransmit timeouts
0 retransmitted
18 SACK recovery episodes
20 segment rexmits in SACK recovery episodes                   <<
28960 byte rexmits in SACK recovery episodes
598 SACK options (SACK blocks) received
0 SACK options (SACK blocks) sent
0 SACK retransmissions lost
0 SACK scoreboard overflow

> I've tried various transfer protocols: ftp, scp, rcp, http: results
> are similar for all.  Ping times for the closest WAN link is 2.3ms,
> furthest is 60ms.  On the furthest link, we get around 15%
> utilisation. Transfer between
> 2 Windows hosts on the furthest link yields ~80% utilisation.

Thus, theoretical max cwnd the sender can grow up to is:
2.3ms x 2Mb/s =3D 575 Byts
60ms  x 2Mb/s =3D 15000 Bytes (around 15KB)

Best Regards,
Cheng Cui


On Sun, Apr 9, 2023 at 5:31=E2=80=AFAM Scheffenegger, Richard <
Richard.Scheffenegger@netapp.com> wrote:

> Hi,
>
> Adding fbsd-transport too.
>
> For stable-12, I believe all relevant (algorithm) improvements went in.
>
> However, 12.2 is missing D26807 and D26808 - improvements in Cubic to
> retransmission timeouts (but these are not material)
>
> While 12.1. has none of the improvements done in 2020 to the Cubic module
> - D18954, D18982, D19118, D23353, D23655, D25065, D25133, D25744, D24657,
> D25746, D25976, D26060, D26807, D26808.
>
> These should fix numerous issues in cubic, which would very likely make i=
t
> perform poorly particularly on longer duration sessions.
>
> However, Cubic is heavily reliant on a valid measurement of RTT and the
> epoch since the last congestion response (measured in units of RTT). An
> issue in getting RTT measured properly would derail cubic for sure (most
> likely cubic would inflate cwnd much faster, then running into significan=
t
> packet loss, very likely loss of retransmissions, followed by
> retransmission timeouts, and shrinking of the ssthresh to small values.
>
>
> I haven't looked into cc_vegas or the ertt module though.
>
> One more initial question: Are you using timestamps on that long, thin
> pipe - or is net.inet.tcp.rfc1323 disabled (more recent versions allow th=
e
> selective enablement/disabling of window scaling and timestamps indepente=
nd
> of each other, but I don't think this is in and 12 release. (D36863)?
>
> Finally, you could be using SIFTR to track the evolution of the minrtt
> value over the course of the session.
>
> Although I suspect ultimately a tcpdump including the tcp header (-s 80) =
,
> and the sifter internal state evolution would be optimal to understanding
> when and why the RTT values go off the rails.
>
>
> At first glance, the ertt module may be prone to miscalculations, when
> retransmissions are in play - no special precautions appear to be present=
,
> to distinguish between the originally sent packet, and any retransmission=
,
> nor any filtering of ACKs which come in as duplicates. Thus there could b=
e
> a scenario, where an ACK for a spurious retransmission, e.g. due to
> reordering, could lead to a wrong baseline RTT measurement, which is
> physically impossible on such a long distance connection...
>
> But again, I haven't looked into the ertt module so far at all.
>
> How are the base stack RTT related values look on these misbehaving
> sessions?
> Tcpcb-> t_rttmin, t_srtt, t_rttvar, t_rxtcur, t_rtttime, t_rtseq,
> t_rttlow, t_rttupdated
>
> Best regards,
>   Richard
>
>
>
>
> -----Original Message-----
> From: Rodney W. Grimes <freebsd-rwg@gndrsh.dnsmgr.net>
> Sent: Sonntag, 9. April 2023 02:59
> To: Richard Perini <rpp@ci.com.au>
> Cc: freebsd-hackers@FreeBSD.org; rscheff@FreeBSD.org
> Subject: Re: low TCP speed, wrong rtt measurement
>
> NetApp Security WARNING: This is an external email. Do not click links or
> open attachments unless you recognize the sender and know the content is
> safe.
>
>
>
>
> > On Tue, Apr 04, 2023 at 02:46:34PM -0000, Peter 'PMc' Much wrote:
> > > ** maybe this should rather go the -net list, but then
> > > ** there are only bug messages
> > >
> > > Hi,
> > >   I'm trying to transfer backup data via WAN; the link bandwidth is
> > > only ~2 Mbit, but this can well run for days and just saturate the
> > > spare bandwidth.
> > >
> > > The problem is, it doesn't saturate the bandwidth.
> > >
> > > I found that the backup application opens the socket in this way:
> > >       if ((fd =3D socket(ipaddr->GetFamily(), SOCK_STREAM, 0)) < 0) {
> > >
> > > Apparently that doesn't work well. So I patched the application to
> > > do it this way:
> > > -      if ((fd =3D socket(ipaddr->GetFamily(), SOCK_STREAM, 0)) < 0) =
{
> > > +      if ((fd =3D socket(ipaddr->GetFamily(), SOCK_STREAM,
> > > + IPPROTO_TCP)) < 0) {
> > >
> > > The result, observed with tcpdump, was now noticeably different, but
> > > rather worse than better.
> > >
> > > I tried various cc algorithms, all behaved very bad with the
> > > exception of cc_vegas. Vegas, after tuning the alpha and beta, gave
> > > satisfying results with less than 1% tradeoff.
> > >
> > > But only for a time. After transferring for a couple of hours the
> > > throughput went bad again:
> > >
> > > # netstat -aC
> > > Proto Recv-Q Send-Q Local Address          Foreign Address
> (state)     CC          cwin   ssthresh   MSS ECN
> > > tcp6       0  57351 edge-jo.26996          pole-n.22
> ESTABLISHED vegas      22203      10392  1311 off
> > > tcp4       0 106305 edge-e.62275           pole-n.bacula-sd
>  ESTABLISHED vegas      11943       5276  1331 off
> > >
> > > The first connection is freshly created. The second one runs for a
> > > day already , and it is obviousely hosed - it doesn't recover.
> > >
> > > # sysctl net.inet.tcp.cc.vegas
> > > net.inet.tcp.cc.vegas.beta: 14
> > > net.inet.tcp.cc.vegas.alpha: 8
> > >
> > > 8 (alpha) x 1331 (mss) =3D 10648
> > >
> > > The cwin is adjusted to precisely one tick above the alpha, and
> > > doesn't rise further. (Increasing the alpha further does solve the
> > > issue for this connection - but that is not how things are supposed
> > > to
> > > work.)
> > >
> > > Now I tried to look into the data that vegas would use for it's
> > > decisions, and found this:
> > >
> > > # dtrace -n 'fbt:kernel:vegas_ack_received:entry { printf("%s %u %d
> > > %d %d %d", execname,\ (*((struct tcpcb **)(arg0+24)))->snd_cwnd,\
> > > ((struct ertt *)((*((struct tcpcb
> > > **)(arg0+24)))->osd->osd_slots[0]))->minrtt,\
> > > ((struct ertt *)((*((struct tcpcb
> > > **)(arg0+24)))->osd->osd_slots[0]))->marked_snd_cwnd,\
> > > ((struct ertt *)((*((struct tcpcb
> > > **)(arg0+24)))->osd->osd_slots[0]))->bytes_tx_in_marked_rtt,\
> > > ((struct ertt *)((*((struct tcpcb
> > > **)(arg0+24)))->osd->osd_slots[0]))->markedpkt_rtt);\
> > > }'
> > > CPU     ID                    FUNCTION:NAME
> > >   6  17478         vegas_ack_received:entry ng_queue 11943 1 11943
> 10552 131
> > >  17  17478         vegas_ack_received:entry ng_queue 22203 56 22203
> 20784 261
> > >  17  17478         vegas_ack_received:entry ng_queue 22203 56 22203
> 20784 261
> > >   3  17478         vegas_ack_received:entry ng_queue 11943 1 11943
> 10552 131
> > >   5  17478         vegas_ack_received:entry ng_queue 22203 56 22203
> 20784 261
> > >  17  17478         vegas_ack_received:entry ng_queue 11943 1 11943
> 10552 131
> > >  11  17478         vegas_ack_received:entry ng_queue 11943 1 11943
> 10552 106
> > >  15  17478         vegas_ack_received:entry ng_queue 22203 56 22203
> 20784 261
> > >  13  17478         vegas_ack_received:entry ng_queue 22203 56 22203
> 20784 261
> > >  16  17478         vegas_ack_received:entry ng_queue 11943 1 11943
> 10552 106
> > >   3  17478         vegas_ack_received:entry ng_queue 22203 56 22203
> 20784 261
> > >
> > > One can see that the "minrtt" value for the freshly created
> > > connection is 56 (which is very plausible).
> > > But the old and hosed connection shows minrtt =3D 1, which explains
> > > the observed cwin.
> > >
> > > The minrtt gets calculated in sys/netinet/khelp/h_ertt.c:
> > >               e_t->rtt =3D tcp_ts_getticks() - txsi->tx_ts + 1; There
> > > is a "+1", so this was apparently zero.
> > >
> > > But source and destination are at least 1000 km apart. So either we
> > > have had one of the rare occasions of hyperspace tunnelling, or
> > > something is going wrong in the ertt measurement code.
> > >
> > > For now this is a one-time observation, but it might also explain
> > > why the other cc algorithms behaved badly. These algorithms are
> > > widely in use and should work - the ertt measurement however is the
> > > same for all of them.
> >
> > I can confirm I am seeing similar problems transferring files to our
> > various production sites around Australia. Various types/sizes of links
> and bandwidths.
> > I can saturate the nearby links, but the link utilisation/saturation
> > decreases with distance.
> >
> > I've tried various transfer protocols: ftp, scp, rcp, http: results
> > are similar for all.  Ping times for the closest WAN link is 2.3ms,
> > furthest is 60ms.  On the furthest link, we get around 15%
> > utilisation. Transfer between
> > 2 Windows hosts on the furthest link yields ~80% utilisation.
>
> Windows should be using cc_cubic, you say above you had tried all the
> congestion algorithims, and only cc_vegas after tuning gave good results.
>
> >
> > FreeBSD versions involved are 12.1 and 12.2.
>
> I wonder if cc_cubic is broken in 12.X, it should give similiar results t=
o
> windows if things are working correctly.
>
> I am adding Richard Scheffenegger as he is the most recent expect on the
> congestion control code in FreeBSD.
>
> > --
> > Richard Perini
> > Ramico Australia Pty Ltd   Sydney, Australia   rpp@ci.com.au  +61 2
> 9552 5500
> > ----------------------------------------------------------------------
> > ------- "The difference between theory and practice is that in theory
> > there is no  difference, but in practice there is"
>
> --
> Rod Grimes
> rgrimes@freebsd.org
>

--0000000000003fb16405f8e87f20
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">First of all, we need to make sure there are TCP retransmi=
ssions that are caused by packet loss.<br>Otherwise, TCP congestion control=
 or cwnd is irrelevant.<br><br>Some tests like below from iperf3 or &quot;n=
etstat -s&quot; can report TCP retransmissions.<br><br>For example, over a =
20ms link, the theoretical max cwnd size is determined by the<br>Bandwidth =
Delay Product (BDP):<br><div>20ms x 10Mb/s =3D 25000 Bytes (around 25KB)</d=
iv><div><br></div><div>cc@s1:~ % ping -c 3 r1<br>PING r1-link1 (10.1.1.3): =
56 data bytes<br>64 bytes from <a href=3D"http://10.1.1.3">10.1.1.3</a>: ic=
mp_seq=3D0 ttl=3D64 time=3D19.807 ms<br>64 bytes from <a href=3D"http://10.=
1.1.3">10.1.1.3</a>: icmp_seq=3D1 ttl=3D64 time=3D19.387 ms<br>64 bytes fro=
m <a href=3D"http://10.1.1.3">10.1.1.3</a>: icmp_seq=3D2 ttl=3D64 time=3D19=
.488 ms<br><br>--- r1-link1 ping statistics ---<br>3 packets transmitted, 3=
 packets received, 0.0% packet loss<br>round-trip min/avg/max/stddev =3D 19=
.387/19.561/19.807/0.179 ms<br><br>before test:<br>cc@s1:~ % netstat -sp tc=
p | egrep &quot;tcp:|retrans|SACK&quot;<br>tcp:<br>		0 data packets (0 byte=
s) retransmitted<br>		0 data packets unnecessarily retransmitted<br>	0 retr=
ansmit timeouts<br>		0 retransmitted<br>	0 SACK recovery episodes<br>	0 seg=
ment rexmits in SACK recovery episodes<br>	0 byte rexmits in SACK recovery =
episodes<br>	0 SACK options (SACK blocks) received<br>	0 SACK options (SACK=
 blocks) sent<br>	0 SACK retransmissions lost<br>	0 SACK scoreboard overflo=
w<br>	<br>cc@s1:~ % iperf3 -c r1 -t 5 -i 1<br>Connecting to host r1, port 5=
201<br>[ =C2=A05] local 10.1.1.2 port 49487 connected to 10.1.1.3 port 5201=
<br>[ ID] Interval =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Transfer =C2=A0 =C2=
=A0 Bitrate =C2=A0 =C2=A0 =C2=A0 =C2=A0 Retr =C2=A0Cwnd<br>[ =C2=A05] =C2=
=A0 0.00-1.00 =C2=A0 sec =C2=A02.58 MBytes =C2=A021.7 Mbits/sec =C2=A0 =C2=
=A07 =C2=A0 11.3 KBytes =C2=A0 =C2=A0 =C2=A0 <br>[ =C2=A05] =C2=A0 1.00-2.0=
0 =C2=A0 sec =C2=A01.39 MBytes =C2=A011.7 Mbits/sec =C2=A0 =C2=A02 =C2=A0 3=
1.0 KBytes =C2=A0 =C2=A0 =C2=A0 <br>[ =C2=A05] =C2=A0 2.00-3.00 =C2=A0 sec =
=C2=A01.14 MBytes =C2=A09.59 Mbits/sec =C2=A0 =C2=A04 =C2=A0 <span style=3D=
"background-color:rgb(255,217,102)">24.1 KBytes </span>=C2=A0 =C2=A0 =C2=A0=
 <br>[ =C2=A05] =C2=A0 3.00-4.00 =C2=A0 sec =C2=A01.01 MBytes =C2=A08.48 Mb=
its/sec =C2=A0 =C2=A03 =C2=A0 30.4 KBytes =C2=A0 =C2=A0 =C2=A0 <br>[ =C2=A0=
5] =C2=A0 4.00-5.00 =C2=A0 sec =C2=A01.33 MBytes =C2=A011.2 Mbits/sec =C2=
=A0 =C2=A04 =C2=A0 23.0 KBytes =C2=A0 =C2=A0 =C2=A0 <br>- - - - - - - - - -=
 - - - - - - - - - - - - - - -<br>[ ID] Interval =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 Transfer =C2=A0 =C2=A0 Bitrate =C2=A0 =C2=A0 =C2=A0 =C2=A0 Retr<=
br>[ =C2=A05] =C2=A0 0.00-5.00 =C2=A0 sec =C2=A07.46 MBytes =C2=A012.5 Mbit=
s/sec =C2=A0 20 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 sender<br>[ =C2=
=A05] =C2=A0 0.00-5.02 =C2=A0 sec =C2=A07.23 MBytes =C2=A012.1 Mbits/sec =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0receiver<br><=
br>iperf Done.<br><br>after test:<br>cc@s1:~ % netstat -sp tcp | egrep &quo=
t;tcp:|retrans|SACK&quot;<br>tcp:<br>		20 data packets (28960 bytes) retran=
smitted =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0&lt;&lt;<br>		0 data packe=
ts unnecessarily retransmitted<br>	0 retransmit timeouts<br>		0 retransmitt=
ed<br>18 SACK recovery episodes<br>20 segment rexmits in SACK recovery epis=
odes =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 &lt;&lt=
;<br>28960 byte rexmits in SACK recovery episodes<br>598 SACK options (SACK=
 blocks) received<br>0 SACK options (SACK blocks) sent<br>0 SACK retransmis=
sions lost<br>0 SACK scoreboard overflow</div><div><div dir=3D"ltr" class=
=3D"gmail_signature" data-smartmail=3D"gmail_signature"><div dir=3D"ltr"><d=
iv><br></div><div>&gt; I&#39;ve tried various transfer protocols: ftp, scp,=
 rcp, http: results<br>&gt; are similar for all.=C2=A0 Ping times for the c=
losest WAN link is 2.3ms,<br>&gt; furthest is 60ms.=C2=A0 On the furthest l=
ink, we get around 15%<br>&gt; utilisation. Transfer between<br>&gt; 2 Wind=
ows hosts on the furthest link yields ~80% utilisation.<br><br>Thus, theore=
tical max cwnd the sender can grow up to is:<br>2.3ms x 2Mb/s =3D 575 Byts<=
br>60ms =C2=A0x 2Mb/s =3D 15000 Bytes (around 15KB)<br></div><div><br></div=
>Best Regards,<div>Cheng Cui</div></div></div></div><br></div><br><div clas=
s=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Sun, Apr 9, 2023=
 at 5:31=E2=80=AFAM Scheffenegger, Richard &lt;<a href=3D"mailto:Richard.Sc=
heffenegger@netapp.com">Richard.Scheffenegger@netapp.com</a>&gt; wrote:<br>=
</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;b=
order-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>
<br>
Adding fbsd-transport too.<br>
<br>
For stable-12, I believe all relevant (algorithm) improvements went in.<br>
<br>
However, 12.2 is missing D26807 and D26808 - improvements in Cubic to retra=
nsmission timeouts (but these are not material)<br>
<br>
While 12.1. has none of the improvements done in 2020 to the Cubic module -=
 D18954, D18982, D19118, D23353, D23655, D25065, D25133, D25744, D24657, D2=
5746, D25976, D26060, D26807, D26808.<br>
<br>
These should fix numerous issues in cubic, which would very likely make it =
perform poorly particularly on longer duration sessions.<br>
<br>
However, Cubic is heavily reliant on a valid measurement of RTT and the epo=
ch since the last congestion response (measured in units of RTT). An issue =
in getting RTT measured properly would derail cubic for sure (most likely c=
ubic would inflate cwnd much faster, then running into significant packet l=
oss, very likely loss of retransmissions, followed by retransmission timeou=
ts, and shrinking of the ssthresh to small values.<br>
<br>
<br>
I haven&#39;t looked into cc_vegas or the ertt module though.<br>
<br>
One more initial question: Are you using timestamps on that long, thin pipe=
 - or is net.inet.tcp.rfc1323 disabled (more recent versions allow the sele=
ctive enablement/disabling of window scaling and timestamps indepentend of =
each other, but I don&#39;t think this is in and 12 release. (D36863)?<br>
<br>
Finally, you could be using SIFTR to track the evolution of the minrtt valu=
e over the course of the session. <br>
<br>
Although I suspect ultimately a tcpdump including the tcp header (-s 80) , =
and the sifter internal state evolution would be optimal to understanding w=
hen and why the RTT values go off the rails.<br>
<br>
<br>
At first glance, the ertt module may be prone to miscalculations, when retr=
ansmissions are in play - no special precautions appear to be present, to d=
istinguish between the originally sent packet, and any retransmission, nor =
any filtering of ACKs which come in as duplicates. Thus there could be a sc=
enario, where an ACK for a spurious retransmission, e.g. due to reordering,=
 could lead to a wrong baseline RTT measurement, which is physically imposs=
ible on such a long distance connection...<br>
<br>
But again, I haven&#39;t looked into the ertt module so far at all.<br>
<br>
How are the base stack RTT related values look on these misbehaving session=
s?<br>
Tcpcb-&gt; t_rttmin, t_srtt, t_rttvar, t_rxtcur, t_rtttime, t_rtseq, t_rttl=
ow, t_rttupdated<br>
<br>
Best regards,<br>
=C2=A0 Richard<br>
<br>
<br>
<br>
<br>
-----Original Message-----<br>
From: Rodney W. Grimes &lt;<a href=3D"mailto:freebsd-rwg@gndrsh.dnsmgr.net"=
 target=3D"_blank">freebsd-rwg@gndrsh.dnsmgr.net</a>&gt; <br>
Sent: Sonntag, 9. April 2023 02:59<br>
To: Richard Perini &lt;<a href=3D"mailto:rpp@ci.com.au" target=3D"_blank">r=
pp@ci.com.au</a>&gt;<br>
Cc: freebsd-hackers@FreeBSD.org; rscheff@FreeBSD.org<br>
Subject: Re: low TCP speed, wrong rtt measurement<br>
<br>
NetApp Security WARNING: This is an external email. Do not click links or o=
pen attachments unless you recognize the sender and know the content is saf=
e.<br>
<br>
<br>
<br>
<br>
&gt; On Tue, Apr 04, 2023 at 02:46:34PM -0000, Peter &#39;PMc&#39; Much wro=
te:<br>
&gt; &gt; ** maybe this should rather go the -net list, but then<br>
&gt; &gt; ** there are only bug messages<br>
&gt; &gt;<br>
&gt; &gt; Hi,<br>
&gt; &gt;=C2=A0 =C2=A0I&#39;m trying to transfer backup data via WAN; the l=
ink bandwidth is <br>
&gt; &gt; only ~2 Mbit, but this can well run for days and just saturate th=
e <br>
&gt; &gt; spare bandwidth.<br>
&gt; &gt;<br>
&gt; &gt; The problem is, it doesn&#39;t saturate the bandwidth.<br>
&gt; &gt;<br>
&gt; &gt; I found that the backup application opens the socket in this way:=
<br>
&gt; &gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0if ((fd =3D socket(ipaddr-&gt;GetFamily=
(), SOCK_STREAM, 0)) &lt; 0) {<br>
&gt; &gt;<br>
&gt; &gt; Apparently that doesn&#39;t work well. So I patched the applicati=
on to <br>
&gt; &gt; do it this way:<br>
&gt; &gt; -=C2=A0 =C2=A0 =C2=A0 if ((fd =3D socket(ipaddr-&gt;GetFamily(), =
SOCK_STREAM, 0)) &lt; 0) {<br>
&gt; &gt; +=C2=A0 =C2=A0 =C2=A0 if ((fd =3D socket(ipaddr-&gt;GetFamily(), =
SOCK_STREAM, <br>
&gt; &gt; + IPPROTO_TCP)) &lt; 0) {<br>
&gt; &gt;<br>
&gt; &gt; The result, observed with tcpdump, was now noticeably different, =
but <br>
&gt; &gt; rather worse than better.<br>
&gt; &gt;<br>
&gt; &gt; I tried various cc algorithms, all behaved very bad with the <br>
&gt; &gt; exception of cc_vegas. Vegas, after tuning the alpha and beta, ga=
ve <br>
&gt; &gt; satisfying results with less than 1% tradeoff.<br>
&gt; &gt;<br>
&gt; &gt; But only for a time. After transferring for a couple of hours the=
 <br>
&gt; &gt; throughput went bad again:<br>
&gt; &gt;<br>
&gt; &gt; # netstat -aC<br>
&gt; &gt; Proto Recv-Q Send-Q Local Address=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 Foreign Address=C2=A0 =C2=A0 =C2=A0 =C2=A0 (state)=C2=A0 =C2=A0 =C2=A0C=
C=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 cwin=C2=A0 =C2=A0ssthresh=C2=A0 =C2=A0M=
SS ECN<br>
&gt; &gt; tcp6=C2=A0 =C2=A0 =C2=A0 =C2=A00=C2=A0 57351 edge-jo.26996=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 pole-n.22=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 ESTABLISHED vegas=C2=A0 =C2=A0 =C2=A0 22203=C2=A0 =C2=A0 =C2=A0 =
10392=C2=A0 1311 off<br>
&gt; &gt; tcp4=C2=A0 =C2=A0 =C2=A0 =C2=A00 106305 edge-e.62275=C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0pole-n.bacula-sd=C2=A0 =C2=A0 =C2=A0 =C2=A0ESTA=
BLISHED vegas=C2=A0 =C2=A0 =C2=A0 11943=C2=A0 =C2=A0 =C2=A0 =C2=A05276=C2=
=A0 1331 off<br>
&gt; &gt;<br>
&gt; &gt; The first connection is freshly created. The second one runs for =
a <br>
&gt; &gt; day already , and it is obviousely hosed - it doesn&#39;t recover=
.<br>
&gt; &gt;<br>
&gt; &gt; # sysctl net.inet.tcp.cc.vegas<br>
&gt; &gt; net.inet.tcp.cc.vegas.beta: 14<br>
&gt; &gt; net.inet.tcp.cc.vegas.alpha: 8<br>
&gt; &gt;<br>
&gt; &gt; 8 (alpha) x 1331 (mss) =3D 10648<br>
&gt; &gt;<br>
&gt; &gt; The cwin is adjusted to precisely one tick above the alpha, and <=
br>
&gt; &gt; doesn&#39;t rise further. (Increasing the alpha further does solv=
e the <br>
&gt; &gt; issue for this connection - but that is not how things are suppos=
ed <br>
&gt; &gt; to<br>
&gt; &gt; work.)<br>
&gt; &gt;<br>
&gt; &gt; Now I tried to look into the data that vegas would use for it&#39=
;s <br>
&gt; &gt; decisions, and found this:<br>
&gt; &gt;<br>
&gt; &gt; # dtrace -n &#39;fbt:kernel:vegas_ack_received:entry { printf(&qu=
ot;%s %u %d <br>
&gt; &gt; %d %d %d&quot;, execname,\ (*((struct tcpcb **)(arg0+24)))-&gt;sn=
d_cwnd,\ <br>
&gt; &gt; ((struct ertt *)((*((struct tcpcb <br>
&gt; &gt; **)(arg0+24)))-&gt;osd-&gt;osd_slots[0]))-&gt;minrtt,\<br>
&gt; &gt; ((struct ertt *)((*((struct tcpcb <br>
&gt; &gt; **)(arg0+24)))-&gt;osd-&gt;osd_slots[0]))-&gt;marked_snd_cwnd,\<b=
r>
&gt; &gt; ((struct ertt *)((*((struct tcpcb <br>
&gt; &gt; **)(arg0+24)))-&gt;osd-&gt;osd_slots[0]))-&gt;bytes_tx_in_marked_=
rtt,\<br>
&gt; &gt; ((struct ertt *)((*((struct tcpcb <br>
&gt; &gt; **)(arg0+24)))-&gt;osd-&gt;osd_slots[0]))-&gt;markedpkt_rtt);\<br=
>
&gt; &gt; }&#39;<br>
&gt; &gt; CPU=C2=A0 =C2=A0 =C2=A0ID=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 FUNCTION:NAME<br>
&gt; &gt;=C2=A0 =C2=A06=C2=A0 17478=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0vegas_=
ack_received:entry ng_queue 11943 1 11943 10552 131<br>
&gt; &gt;=C2=A0 17=C2=A0 17478=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0vegas_ack_r=
eceived:entry ng_queue 22203 56 22203 20784 261<br>
&gt; &gt;=C2=A0 17=C2=A0 17478=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0vegas_ack_r=
eceived:entry ng_queue 22203 56 22203 20784 261<br>
&gt; &gt;=C2=A0 =C2=A03=C2=A0 17478=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0vegas_=
ack_received:entry ng_queue 11943 1 11943 10552 131<br>
&gt; &gt;=C2=A0 =C2=A05=C2=A0 17478=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0vegas_=
ack_received:entry ng_queue 22203 56 22203 20784 261<br>
&gt; &gt;=C2=A0 17=C2=A0 17478=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0vegas_ack_r=
eceived:entry ng_queue 11943 1 11943 10552 131<br>
&gt; &gt;=C2=A0 11=C2=A0 17478=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0vegas_ack_r=
eceived:entry ng_queue 11943 1 11943 10552 106<br>
&gt; &gt;=C2=A0 15=C2=A0 17478=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0vegas_ack_r=
eceived:entry ng_queue 22203 56 22203 20784 261<br>
&gt; &gt;=C2=A0 13=C2=A0 17478=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0vegas_ack_r=
eceived:entry ng_queue 22203 56 22203 20784 261<br>
&gt; &gt;=C2=A0 16=C2=A0 17478=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0vegas_ack_r=
eceived:entry ng_queue 11943 1 11943 10552 106<br>
&gt; &gt;=C2=A0 =C2=A03=C2=A0 17478=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0vegas_=
ack_received:entry ng_queue 22203 56 22203 20784 261<br>
&gt; &gt;<br>
&gt; &gt; One can see that the &quot;minrtt&quot; value for the freshly cre=
ated <br>
&gt; &gt; connection is 56 (which is very plausible).<br>
&gt; &gt; But the old and hosed connection shows minrtt =3D 1, which explai=
ns <br>
&gt; &gt; the observed cwin.<br>
&gt; &gt;<br>
&gt; &gt; The minrtt gets calculated in sys/netinet/khelp/h_ertt.c:<br>
&gt; &gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0e_t-&gt;rtt=
 =3D tcp_ts_getticks() - txsi-&gt;tx_ts + 1; There <br>
&gt; &gt; is a &quot;+1&quot;, so this was apparently zero.<br>
&gt; &gt;<br>
&gt; &gt; But source and destination are at least 1000 km apart. So either =
we <br>
&gt; &gt; have had one of the rare occasions of hyperspace tunnelling, or <=
br>
&gt; &gt; something is going wrong in the ertt measurement code.<br>
&gt; &gt;<br>
&gt; &gt; For now this is a one-time observation, but it might also explain=
 <br>
&gt; &gt; why the other cc algorithms behaved badly. These algorithms are <=
br>
&gt; &gt; widely in use and should work - the ertt measurement however is t=
he <br>
&gt; &gt; same for all of them.<br>
&gt;<br>
&gt; I can confirm I am seeing similar problems transferring files to our <=
br>
&gt; various production sites around Australia. Various types/sizes of link=
s and bandwidths.<br>
&gt; I can saturate the nearby links, but the link utilisation/saturation <=
br>
&gt; decreases with distance.<br>
&gt;<br>
&gt; I&#39;ve tried various transfer protocols: ftp, scp, rcp, http: result=
s <br>
&gt; are similar for all.=C2=A0 Ping times for the closest WAN link is 2.3m=
s, <br>
&gt; furthest is 60ms.=C2=A0 On the furthest link, we get around 15% <br>
&gt; utilisation. Transfer between<br>
&gt; 2 Windows hosts on the furthest link yields ~80% utilisation.<br>
<br>
Windows should be using cc_cubic, you say above you had tried all the conge=
stion algorithims, and only cc_vegas after tuning gave good results.<br>
<br>
&gt;<br>
&gt; FreeBSD versions involved are 12.1 and 12.2.<br>
<br>
I wonder if cc_cubic is broken in 12.X, it should give similiar results to =
windows if things are working correctly.<br>
<br>
I am adding Richard Scheffenegger as he is the most recent expect on the co=
ngestion control code in FreeBSD.<br>
<br>
&gt; --<br>
&gt; Richard Perini<br>
&gt; Ramico Australia Pty Ltd=C2=A0 =C2=A0Sydney, Australia=C2=A0 =C2=A0<a =
href=3D"mailto:rpp@ci.com.au" target=3D"_blank">rpp@ci.com.au</a>=C2=A0 +61=
 2 9552 5500<br>
&gt; ----------------------------------------------------------------------=
<br>
&gt; ------- &quot;The difference between theory and practice is that in th=
eory <br>
&gt; there is no=C2=A0 difference, but in practice there is&quot;<br>
<br>
--<br>
Rod Grimes=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0<a href=3D"mailto:rgrimes@freebsd.org=
" target=3D"_blank">rgrimes@freebsd.org</a><br>
</blockquote></div>

--0000000000003fb16405f8e87f20--

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAGaXuiJThYFMfw4%2BjFM-pxkHvyfg4XPLp=2zf-fT_u33eHP%2Bwg>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation