Date: Sun, 9 Apr 2023 08:46:39 +1000 From: Richard Perini <rpp@ci.com.au> To: freebsd-hackers@freebsd.org Subject: Re: low TCP speed, wrong rtt measurement Message-ID: <ZDHuz%2B/p3EemMnK7@jodi.ci.com.au> In-Reply-To: <slrnu2oe2a.1uri.pmc@disp.intra.daemon.contact> References: <slrnu2oe2a.1uri.pmc@disp.intra.daemon.contact>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Apr 04, 2023 at 02:46:34PM -0000, Peter 'PMc' Much wrote:
> ** maybe this should rather go the -net list, but then
> ** there are only bug messages
> 
> Hi,
>   I'm trying to transfer backup data via WAN; the link bandwidth is
> only ~2 Mbit, but this can well run for days and just saturate the spare
> bandwidth. 
> 
> The problem is, it doesn't saturate the bandwidth.
> 
> I found that the backup application opens the socket in this way:
>       if ((fd = socket(ipaddr->GetFamily(), SOCK_STREAM, 0)) < 0) {
> 
> Apparently that doesn't work well. So I patched the application to do
> it this way:
> -      if ((fd = socket(ipaddr->GetFamily(), SOCK_STREAM, 0)) < 0) {
> +      if ((fd = socket(ipaddr->GetFamily(), SOCK_STREAM, IPPROTO_TCP)) < 0) {
> 
> The result, observed with tcpdump, was now noticeably different, but
> rather worse than better.
> 
> I tried various cc algorithms, all behaved very bad with the exception
> of cc_vegas. Vegas, after tuning the alpha and beta, gave satisfying
> results with less than 1% tradeoff.
> 
> But only for a time. After transferring for a couple of hours the
> throughput went bad again:
> 
> # netstat -aC
> Proto Recv-Q Send-Q Local Address          Foreign Address        (state)     CC          cwin   ssthresh   MSS ECN
> tcp6       0  57351 edge-jo.26996          pole-n.22              ESTABLISHED vegas      22203      10392  1311 off
> tcp4       0 106305 edge-e.62275           pole-n.bacula-sd       ESTABLISHED vegas      11943       5276  1331 off
> 
> The first connection is freshly created. The second one runs for a day
> already , and it is obviousely hosed - it doesn't recover.
> 
> # sysctl net.inet.tcp.cc.vegas
> net.inet.tcp.cc.vegas.beta: 14
> net.inet.tcp.cc.vegas.alpha: 8
> 
> 8 (alpha) x 1331 (mss) = 10648
> 
> The cwin is adjusted to precisely one tick above the alpha, and
> doesn't rise further. (Increasing the alpha further does solve the
> issue for this connection - but that is not how things are supposed to
> work.)
> 
> Now I tried to look into the data that vegas would use for it's
> decisions, and found this:
> 
> # dtrace -n 'fbt:kernel:vegas_ack_received:entry { printf("%s %u %d %d %d %d", execname,\
> (*((struct tcpcb **)(arg0+24)))->snd_cwnd,\
> ((struct ertt *)((*((struct tcpcb **)(arg0+24)))->osd->osd_slots[0]))->minrtt,\
> ((struct ertt *)((*((struct tcpcb **)(arg0+24)))->osd->osd_slots[0]))->marked_snd_cwnd,\
> ((struct ertt *)((*((struct tcpcb **)(arg0+24)))->osd->osd_slots[0]))->bytes_tx_in_marked_rtt,\
> ((struct ertt *)((*((struct tcpcb **)(arg0+24)))->osd->osd_slots[0]))->markedpkt_rtt);\
> }'
> CPU     ID                    FUNCTION:NAME
>   6  17478         vegas_ack_received:entry ng_queue 11943 1 11943 10552 131
>  17  17478         vegas_ack_received:entry ng_queue 22203 56 22203 20784 261
>  17  17478         vegas_ack_received:entry ng_queue 22203 56 22203 20784 261
>   3  17478         vegas_ack_received:entry ng_queue 11943 1 11943 10552 131
>   5  17478         vegas_ack_received:entry ng_queue 22203 56 22203 20784 261
>  17  17478         vegas_ack_received:entry ng_queue 11943 1 11943 10552 131
>  11  17478         vegas_ack_received:entry ng_queue 11943 1 11943 10552 106
>  15  17478         vegas_ack_received:entry ng_queue 22203 56 22203 20784 261
>  13  17478         vegas_ack_received:entry ng_queue 22203 56 22203 20784 261
>  16  17478         vegas_ack_received:entry ng_queue 11943 1 11943 10552 106
>   3  17478         vegas_ack_received:entry ng_queue 22203 56 22203 20784 261
> 
> One can see that the "minrtt" value for the freshly created connection
> is 56 (which is very plausible).
> But the old and hosed connection shows minrtt = 1, which explains the
> observed cwin.
> 
> The minrtt gets calculated in sys/netinet/khelp/h_ertt.c:
>               e_t->rtt = tcp_ts_getticks() - txsi->tx_ts + 1;
> There is a "+1", so this was apparently zero.
> 
> But source and destination are at least 1000 km apart. So either we
> have had one of the rare occasions of hyperspace tunnelling, or
> something is going wrong in the ertt measurement code.
> 
> For now this is a one-time observation, but it might also explain why
> the other cc algorithms behaved badly. These algorithms are widely in
> use and should work - the ertt measurement however is the same for all of
> them.
I can confirm I am seeing similar problems transferring files to our various
production sites around Australia. Various types/sizes of links and bandwidths.
I can saturate the nearby links, but the link utilisation/saturation decreases
with distance.
I've tried various transfer protocols: ftp, scp, rcp, http: results are 
similar for all.  Ping times for the closest WAN link is 2.3ms, furthest is
60ms.  On the furthest link, we get around 15% utilisation. Transfer between
2 Windows hosts on the furthest link yields ~80% utilisation.
FreeBSD versions involved are 12.1 and 12.2.
--
Richard Perini  
Ramico Australia Pty Ltd   Sydney, Australia   rpp@ci.com.au  +61 2 9552 5500
-----------------------------------------------------------------------------
"The difference between theory and practice is that in theory there is no
 difference, but in practice there is"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?ZDHuz%2B/p3EemMnK7>
