From nobody Sun Apr 9 00:58:53 2023 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4PvDJb604sz455QM for ; Sun, 9 Apr 2023 00:59:11 +0000 (UTC) (envelope-from freebsd-rwg@gndrsh.dnsmgr.net) Received: from gndrsh.dnsmgr.net (br1.CN84in.dnsmgr.net [69.59.192.140]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4PvDJZ4G7Hz3wlR; Sun, 9 Apr 2023 00:59:10 +0000 (UTC) (envelope-from freebsd-rwg@gndrsh.dnsmgr.net) Authentication-Results: mx1.freebsd.org; none Received: from gndrsh.dnsmgr.net (localhost [127.0.0.1]) by gndrsh.dnsmgr.net (8.13.3/8.13.3) with ESMTP id 3390wr4x020758; Sat, 8 Apr 2023 17:58:53 -0700 (PDT) (envelope-from freebsd-rwg@gndrsh.dnsmgr.net) Received: (from freebsd-rwg@localhost) by gndrsh.dnsmgr.net (8.13.3/8.13.3/Submit) id 3390wrE1020757; Sat, 8 Apr 2023 17:58:53 -0700 (PDT) (envelope-from freebsd-rwg) From: "Rodney W. Grimes" Message-Id: <202304090058.3390wrE1020757@gndrsh.dnsmgr.net> Subject: Re: low TCP speed, wrong rtt measurement In-Reply-To: To: Richard Perini Date: Sat, 8 Apr 2023 17:58:53 -0700 (PDT) CC: freebsd-hackers@FreeBSD.org, rscheff@FreeBSD.org X-Mailer: ELM [version 2.4ME+ PL121h (25)] List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=US-ASCII X-Rspamd-Queue-Id: 4PvDJZ4G7Hz3wlR X-Spamd-Bar: ---- X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:13868, ipnet:69.59.192.0/19, country:US] X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-ThisMailContainsUnwantedMimeParts: N > On Tue, Apr 04, 2023 at 02:46:34PM -0000, Peter 'PMc' Much wrote: > > ** maybe this should rather go the -net list, but then > > ** there are only bug messages > > > > Hi, > > I'm trying to transfer backup data via WAN; the link bandwidth is > > only ~2 Mbit, but this can well run for days and just saturate the spare > > bandwidth. > > > > The problem is, it doesn't saturate the bandwidth. > > > > I found that the backup application opens the socket in this way: > > if ((fd = socket(ipaddr->GetFamily(), SOCK_STREAM, 0)) < 0) { > > > > Apparently that doesn't work well. So I patched the application to do > > it this way: > > - if ((fd = socket(ipaddr->GetFamily(), SOCK_STREAM, 0)) < 0) { > > + if ((fd = socket(ipaddr->GetFamily(), SOCK_STREAM, IPPROTO_TCP)) < 0) { > > > > The result, observed with tcpdump, was now noticeably different, but > > rather worse than better. > > > > I tried various cc algorithms, all behaved very bad with the exception > > of cc_vegas. Vegas, after tuning the alpha and beta, gave satisfying > > results with less than 1% tradeoff. > > > > But only for a time. After transferring for a couple of hours the > > throughput went bad again: > > > > # netstat -aC > > Proto Recv-Q Send-Q Local Address Foreign Address (state) CC cwin ssthresh MSS ECN > > tcp6 0 57351 edge-jo.26996 pole-n.22 ESTABLISHED vegas 22203 10392 1311 off > > tcp4 0 106305 edge-e.62275 pole-n.bacula-sd ESTABLISHED vegas 11943 5276 1331 off > > > > The first connection is freshly created. The second one runs for a day > > already , and it is obviousely hosed - it doesn't recover. > > > > # sysctl net.inet.tcp.cc.vegas > > net.inet.tcp.cc.vegas.beta: 14 > > net.inet.tcp.cc.vegas.alpha: 8 > > > > 8 (alpha) x 1331 (mss) = 10648 > > > > The cwin is adjusted to precisely one tick above the alpha, and > > doesn't rise further. (Increasing the alpha further does solve the > > issue for this connection - but that is not how things are supposed to > > work.) > > > > Now I tried to look into the data that vegas would use for it's > > decisions, and found this: > > > > # dtrace -n 'fbt:kernel:vegas_ack_received:entry { printf("%s %u %d %d %d %d", execname,\ > > (*((struct tcpcb **)(arg0+24)))->snd_cwnd,\ > > ((struct ertt *)((*((struct tcpcb **)(arg0+24)))->osd->osd_slots[0]))->minrtt,\ > > ((struct ertt *)((*((struct tcpcb **)(arg0+24)))->osd->osd_slots[0]))->marked_snd_cwnd,\ > > ((struct ertt *)((*((struct tcpcb **)(arg0+24)))->osd->osd_slots[0]))->bytes_tx_in_marked_rtt,\ > > ((struct ertt *)((*((struct tcpcb **)(arg0+24)))->osd->osd_slots[0]))->markedpkt_rtt);\ > > }' > > CPU ID FUNCTION:NAME > > 6 17478 vegas_ack_received:entry ng_queue 11943 1 11943 10552 131 > > 17 17478 vegas_ack_received:entry ng_queue 22203 56 22203 20784 261 > > 17 17478 vegas_ack_received:entry ng_queue 22203 56 22203 20784 261 > > 3 17478 vegas_ack_received:entry ng_queue 11943 1 11943 10552 131 > > 5 17478 vegas_ack_received:entry ng_queue 22203 56 22203 20784 261 > > 17 17478 vegas_ack_received:entry ng_queue 11943 1 11943 10552 131 > > 11 17478 vegas_ack_received:entry ng_queue 11943 1 11943 10552 106 > > 15 17478 vegas_ack_received:entry ng_queue 22203 56 22203 20784 261 > > 13 17478 vegas_ack_received:entry ng_queue 22203 56 22203 20784 261 > > 16 17478 vegas_ack_received:entry ng_queue 11943 1 11943 10552 106 > > 3 17478 vegas_ack_received:entry ng_queue 22203 56 22203 20784 261 > > > > One can see that the "minrtt" value for the freshly created connection > > is 56 (which is very plausible). > > But the old and hosed connection shows minrtt = 1, which explains the > > observed cwin. > > > > The minrtt gets calculated in sys/netinet/khelp/h_ertt.c: > > e_t->rtt = tcp_ts_getticks() - txsi->tx_ts + 1; > > There is a "+1", so this was apparently zero. > > > > But source and destination are at least 1000 km apart. So either we > > have had one of the rare occasions of hyperspace tunnelling, or > > something is going wrong in the ertt measurement code. > > > > For now this is a one-time observation, but it might also explain why > > the other cc algorithms behaved badly. These algorithms are widely in > > use and should work - the ertt measurement however is the same for all of > > them. > > I can confirm I am seeing similar problems transferring files to our various > production sites around Australia. Various types/sizes of links and bandwidths. > I can saturate the nearby links, but the link utilisation/saturation decreases > with distance. > > I've tried various transfer protocols: ftp, scp, rcp, http: results are > similar for all. Ping times for the closest WAN link is 2.3ms, furthest is > 60ms. On the furthest link, we get around 15% utilisation. Transfer between > 2 Windows hosts on the furthest link yields ~80% utilisation. Windows should be using cc_cubic, you say above you had tried all the congestion algorithims, and only cc_vegas after tuning gave good results. > > FreeBSD versions involved are 12.1 and 12.2. I wonder if cc_cubic is broken in 12.X, it should give similiar results to windows if things are working correctly. I am adding Richard Scheffenegger as he is the most recent expect on the congestion control code in FreeBSD. > -- > Richard Perini > Ramico Australia Pty Ltd Sydney, Australia rpp@ci.com.au +61 2 9552 5500 > ----------------------------------------------------------------------------- > "The difference between theory and practice is that in theory there is no > difference, but in practice there is" -- Rod Grimes rgrimes@freebsd.org