From nobody Mon Jul 3 20:24:00 2023 X-Original-To: freebsd-transport@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Qvy7g3Wqmz4ln27 for ; Mon, 3 Jul 2023 20:24:15 +0000 (UTC) (envelope-from ccfreebsd@gmail.com) Received: from mail-ot1-f45.google.com (mail-ot1-f45.google.com [209.85.210.45]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Qvy7f2VYnz4NvX; Mon, 3 Jul 2023 20:24:14 +0000 (UTC) (envelope-from ccfreebsd@gmail.com) Authentication-Results: mx1.freebsd.org; dkim=none; spf=pass (mx1.freebsd.org: domain of ccfreebsd@gmail.com designates 209.85.210.45 as permitted sender) smtp.mailfrom=ccfreebsd@gmail.com; dmarc=none Received: by mail-ot1-f45.google.com with SMTP id 46e09a7af769-6b886456f66so2892834a34.0; Mon, 03 Jul 2023 13:24:14 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1688415853; x=1691007853; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=kL2SvjOXZLq1YS2WATLstRGlnwz+SOIQSfWPaz1q6HM=; b=azbanqN4T+aUa6T5DB4Pl+HBmjKvs7f8V7j+Ep2mPPtaHFsHeJwTXuLz3Wm/iImlOy ySaYnMCG05aDWr6QpyOlXrgWmCaFL5J6ePbEguJms0o/fEYJdMMjyibXwNqE9mJc3EAi sxTEZqGX+6vFhztHGdukKW1N/DxGeWKnBnU5NO7CfqNBVzy01Fg7+QQTnVtka3Q1H9cQ 5vYOBz0rw4HKmUE15aYekuG1I3zI359yeDiSVDfNC4IbMm7AY56vwsAkM4hHeFfy2Bsm pduVwQZGaAWxhD15t4cFgjvsgPzyEsVpT8jxvxuoG201VvYSYYlTNuh5yzcghuuqjqOx beaw== X-Gm-Message-State: AC+VfDxvnArD1PSE83uQ2U6qRjdnfZZ0OVeADDhgsuplDRbtEXf+jmjK whWXVpM1l3amQ5bkFbtKCuKKLT1qr74= X-Google-Smtp-Source: ACHHUZ4GaEmea7EiI2GeuU72+8PG7/lp7yQHp1OeoYEi4kAxzDy4E4H4Lm9Jpjuji6cue3KYZOP1xQ== X-Received: by 2002:a05:6830:149a:b0:6b7:1fd6:50b3 with SMTP id s26-20020a056830149a00b006b71fd650b3mr9123807otq.31.1688415852611; Mon, 03 Jul 2023 13:24:12 -0700 (PDT) Received: from mail-oa1-f52.google.com (mail-oa1-f52.google.com. [209.85.160.52]) by smtp.gmail.com with ESMTPSA id k9-20020a0568301be900b006b87c41a57esm5498400otb.11.2023.07.03.13.24.11 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 03 Jul 2023 13:24:12 -0700 (PDT) Received: by mail-oa1-f52.google.com with SMTP id 586e51a60fabf-19a427d7b57so3257872fac.2; Mon, 03 Jul 2023 13:24:11 -0700 (PDT) X-Received: by 2002:a05:6870:5d10:b0:1b0:e98:163b with SMTP id fv16-20020a0568705d1000b001b00e98163bmr7460373oab.21.1688415851747; Mon, 03 Jul 2023 13:24:11 -0700 (PDT) List-Id: Discussions List-Archive: https://lists.freebsd.org/archives/freebsd-transport List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-transport@freebsd.org X-BeenThere: freebsd-transport@freebsd.org MIME-Version: 1.0 References: <53aff274-b1a8-0730-6971-2755c7e7b688@freebsd.org> In-Reply-To: From: Cheng Cui Date: Mon, 3 Jul 2023 16:24:00 -0400 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: FreeBSD TCP (with iperf3) comparison with Linux To: Murali Krishnamurthy Cc: "Scheffenegger, Richard" , FreeBSD Transport Content-Type: multipart/alternative; boundary="000000000000139feb05ff9af368" X-Spamd-Result: default: False [-0.11 / 15.00]; HTTP_TO_IP(1.00)[]; NEURAL_HAM_LONG(-0.99)[-0.990]; NEURAL_HAM_SHORT(-0.99)[-0.985]; NEURAL_SPAM_MEDIUM(0.96)[0.962]; FORGED_SENDER(0.30)[cc@freebsd.org,ccfreebsd@gmail.com]; R_SPF_ALLOW(-0.20)[+ip4:209.85.128.0/17:c]; RWL_MAILSPIKE_GOOD(-0.10)[209.85.210.45:from]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; MLMMJ_DEST(0.00)[freebsd-transport@freebsd.org]; FREEMAIL_ENVFROM(0.00)[gmail.com]; RCVD_IN_DNSWL_NONE(0.00)[209.85.210.45:from,209.85.160.52:received]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ARC_NA(0.00)[]; RCVD_COUNT_THREE(0.00)[4]; TO_MATCH_ENVRCPT_SOME(0.00)[]; FROM_HAS_DN(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; TO_DN_ALL(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; DMARC_NA(0.00)[freebsd.org]; RCVD_TLS_LAST(0.00)[]; ASN(0.00)[asn:15169, ipnet:209.85.128.0/17, country:US]; FROM_NEQ_ENVFROM(0.00)[cc@freebsd.org,ccfreebsd@gmail.com] X-Rspamd-Queue-Id: 4Qvy7f2VYnz4NvX X-Spamd-Bar: / X-ThisMailContainsUnwantedMimeParts: N --000000000000139feb05ff9af368 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I see. Sorry about a straight description in my previous email. If you found the iperf3 report shows bad throughput and increasing numbers in the "Retr" field, also the "netstat -sp tcp" shows retransmitted packets without SACK recovery episodes (SACK is enabled by default). Then, you are likely hitting the problem I described, and the root cause is the TX queue drops. The tcpdump trace file won't show any packet retransmissions and the peer won't be aware of packet loss, as this is a local problem. cc@s1:~ % netstat -sp tcp | egrep "tcp:|retrans|SACK" tcp: 139 data packets (300416 bytes) retransmitted << 0 data packets unnecessarily retransmitted 3 retransmit timeouts 0 retransmitted 0 SACK recovery episodes << 0 segment rexmits in SACK recovery episodes 0 byte rexmits in SACK recovery episodes 0 SACK options (SACK blocks) received 0 SACK options (SACK blocks) sent 0 SACK retransmissions lost 0 SACK scoreboard overflow Local packet drops due to TX full can be found from this command, for example cc@s1:~ % netstat -i -I bce4 -nd Name Mtu Network Address Ipkts Ierrs Idrop Opkts Oerrs Coll Drop bce4 1500 00:10:18:56:94:d4 286184 0 0 148079 0 0 54 << bce4 - 10.1.1.0/24 10.1.1.2 286183 - - 582111 - - - cc@s1:= ~ % Hope the above stats can help you better root cause analysis. Also, increasing the TX queue size is a workaround and is specific to a particular NIC. But you get the idea. Best Regards, Cheng Cui On Mon, Jul 3, 2023 at 11:34=E2=80=AFAM Murali Krishnamurthy wrote: > Cheng, > > > > Thanks for your inputs. > > > > Sorry, I am not familiar with this area. > > > > Few queries, > > > > =E2=80=9CI believe the default values for bce tx/rx pages are 2. And I ha= ppened to > find > this problem before that when the tx queue was full, it would not enqueue > packets > and started return errors. > And this error was misunderstood by the TCP layer as retransmission.=E2= =80=9D > > > > Could you please elaborate what is misunderstood by TCP here? Loss of > packets should anyway lead to retransmissions. > > > > Could you point to some stats where I can see such drops due to queue > getting full? > > > > I have a vmx interface in my VM and I have attached the screenshot of > ifconfig command for that. > > Anything we can understand from that? > > Will your suggestion of increasing tx_pages=3D4 and rx_pages=3D4 work for= this > ? If so, I assume names would be hw.vmx.tx_pages=3D4 and hw.vmx.rx_pages = ? > > > > Regards > > Murali > > > > > > *From: *Cheng Cui > *Date: *Friday, 30 June 2023 at 10:02 PM > *To: *Murali Krishnamurthy > *Cc: *Scheffenegger, Richard , FreeBSD Transport < > freebsd-transport@freebsd.org> > *Subject: *Re: FreeBSD TCP (with iperf3) comparison with Linux > > *!! External Email* > > I used an emulation testbed from Emulab.net with Dummynet traffic shaper > adding 100ms RTT > between two nodes, the link capacity is 1Gbps and both nodes are using > freebsd13.2. > > cc@s1:~ % ping -c 3 r1 > PING r1-link1 (10.1.1.3): 56 data bytes > 64 bytes from 10.1.1.3: icmp_seq=3D0 ttl=3D64 time=3D100.091 ms > 64 bytes from 10.1.1.3: icmp_seq=3D1 ttl=3D64 time=3D99.995 ms > 64 bytes from 10.1.1.3: icmp_seq=3D2 ttl=3D64 time=3D99.979 ms > > --- r1-link1 ping statistics --- > 3 packets transmitted, 3 packets received, 0.0% packet loss > round-trip min/avg/max/stddev =3D 99.979/100.022/100.091/0.049 ms > > > cc@s1:~ % iperf3 -c r1 -t 10 -i 1 -C cubic > Connecting to host r1, port 5201 > [ 5] local 10.1.1.2 port 56089 connected to 10.1.1.3 port 5201 > [ ID] Interval Transfer Bitrate Retr Cwnd > [ 5] 0.00-1.00 sec 4.19 MBytes 35.2 Mbits/sec 0 1.24 MBytes > > [ 5] 1.00-2.00 sec 56.5 MBytes 474 Mbits/sec 6 2.41 MBytes > > [ 5] 2.00-3.00 sec 58.6 MBytes 492 Mbits/sec 18 7.17 MBytes > > [ 5] 3.00-4.00 sec 65.6 MBytes 550 Mbits/sec 14 606 KBytes > > [ 5] 4.00-5.00 sec 60.8 MBytes 510 Mbits/sec 18 7.22 MBytes > > [ 5] 5.00-6.00 sec 62.1 MBytes 521 Mbits/sec 12 7.86 MBytes > > [ 5] 6.00-7.00 sec 60.9 MBytes 512 Mbits/sec 14 3.43 MBytes > > [ 5] 7.00-8.00 sec 62.8 MBytes 527 Mbits/sec 16 372 KBytes > > [ 5] 8.00-9.00 sec 59.3 MBytes 497 Mbits/sec 14 1.77 MBytes > > [ 5] 9.00-10.00 sec 57.0 MBytes 477 Mbits/sec 18 7.13 MBytes > > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-10.00 sec 548 MBytes 459 Mbits/sec 130 > sender > [ 5] 0.00-10.10 sec 540 MBytes 449 Mbits/sec > receiver > > iperf Done. > > cc@s1:~ % ifconfig bce4 > bce4: flags=3D8843 metric 0 mtu 1= 500 > > options=3Dc01bb > ether 00:10:18:56:94:d4 > inet 10.1.1.2 netmask 0xffffff00 broadcast 10.1.1.255 > media: Ethernet 1000baseT > status: active > nd6 options=3D29 > > I believe the default values for bce tx/rx pages are 2. And I happened to > find > this problem before that when the tx queue was full, it would not enqueue > packets > and started return errors. > And this error was misunderstood by the TCP layer as retransmission. > > After adding hw.bce.tx_pages=3D4 and hw.bce.rx_pages=3D4 in /boot/loader.= conf > and reboot: > > cc@s1:~ % iperf3 -c r1 -t 10 -i 1 -C cubic > Connecting to host r1, port 5201 > [ 5] local 10.1.1.2 port 20478 connected to 10.1.1.3 port 5201 > [ ID] Interval Transfer Bitrate Retr Cwnd > [ 5] 0.00-1.00 sec 4.15 MBytes 34.8 Mbits/sec 0 1.17 MBytes > > [ 5] 1.00-2.00 sec 83.1 MBytes 697 Mbits/sec 0 12.2 MBytes > > [ 5] 2.00-3.00 sec 112 MBytes 939 Mbits/sec 0 12.2 MBytes > > [ 5] 3.00-4.00 sec 113 MBytes 944 Mbits/sec 0 12.2 MBytes > > [ 5] 4.00-5.00 sec 112 MBytes 940 Mbits/sec 0 12.2 MBytes > > [ 5] 5.00-6.00 sec 112 MBytes 942 Mbits/sec 0 12.2 MBytes > > [ 5] 6.00-7.00 sec 112 MBytes 938 Mbits/sec 0 12.2 MBytes > > [ 5] 7.00-8.00 sec 113 MBytes 944 Mbits/sec 0 12.2 MBytes > > [ 5] 8.00-9.00 sec 112 MBytes 938 Mbits/sec 0 12.2 MBytes > > [ 5] 9.00-10.00 sec 113 MBytes 947 Mbits/sec 0 12.2 MBytes > > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-10.00 sec 985 MBytes 826 Mbits/sec 0 > sender > [ 5] 0.00-10.11 sec 982 MBytes 815 Mbits/sec > receiver > > iperf Done. > > > > Best Regards, > > Cheng Cui > > > > > > On Fri, Jun 30, 2023 at 12:26=E2=80=AFPM Murali Krishnamurthy > wrote: > > Richard, > > > > Appreciate the useful inputs you have shared so far. Will try to figure > out regarding packet drops. > > > > Regarding HyStart, I see even BSD code base has support for this. May I > know by when can we see that in an release, if not already available ? > > > > Regarding this point : *=E2=80=9CSwitching to other cc modules may give s= ome more > insights. But again, I suspect that momentary (microsecond) burstiness of > BSD may be causing this significantly higher loss rate.=E2=80=9D* > > Is there some info somewhere where I can understand more on this in detai= l? > > > > Regards > > Murali > > > > > > On 30/06/23, 9:35 PM, "owner-freebsd-transport@freebsd.org" < > owner-freebsd-transport@freebsd.org> wrote: > > > > Hi Murali, > > > > > Q. Since you mention two hypervisors - what is the phyiscal network > topology in between these two servers? What theoretical link rates would = be > attainable? > > > > > > Here is the topology > > > > > > Iperf end points are on 2 different hypervisors. > > > > > > =E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80= =94=E2=80=94=E2=80=94=E2=80=94 =E2=80=94=E2=80=94=E2=80=94=E2=80=94= =E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2= =80=94=E2=80=94=E2=80=94=E2=80=94 > =E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94 =E2= =80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94-=E2=80=94 > > > | Linux VM1 | | BSD 13 VM > 1 | > | Linux VM2 | | BSD 13 VM 2 | > > > |___________| |_ ____ ____ ___ > | = |___________ > | |_ ____ ____ ___ | > > > | | > | > | | > > > > | | > | | > > > > =E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94= =E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94 = =E2= =80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80= =94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94 > > > | ESX Hypervisor 1 | 10G link connected vi= a > L2 Switch | ESX Hypervisor 2 | > > > | > |=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94= =E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2= =80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94 > | | > > > |=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80= =94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94 > | > |=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94= =E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94| > > > > > > > > > Nic is of 10G capacity on both ESX server and it has below config. > > > > > > So, when both VMs run on the same Hypervisor, maybe with another VM to > simulate the 100ms delay, can you attain a lossless baseline scenario? > > > > > > > BDP for 16MB Socket buffer: 16 MB * (1000 ms * 100ms latency) * 8 bits/ > 1024 =3D 1.25 Gbps > > > > > > So theoretically we should see close to 1.25Gbps of Bitrate and we see > Linux reaching close to this number. > > > > Under no loss, yes. > > > > > > > But BSD is not able to do that. > > > > > > > > > Q. Did you run iperf3? Did the transmitting endpoint report any > retransmissions between Linux or FBSD hosts? > > > > > > Yes, we used iper3. I see Linux doing less number retransmissions > compared to BSD. > > > On BSD, the best performance was around 600 Mbps bitrate and the number > of retransmissions for this number seen is around 32K > > > On Linux, the best performance was around 1.15 Gbps bitrate and the > number of retransmissions for this number seen is only 2K. > > > So as you pointed the number of retransmissions in BSD could be the rea= l > issue here. > > > > There are other cc modules available; but I believe one major deviation i= s > that Linux can perform mechanisms like hystart; ACKing every packet when > the client detects slow start; perform pacing to achieve more uniform > packet transmissions. > > > > I think the next step would be to find out, at which queue those packet > discards are coming from (external switch? delay generator? Vswitch? Eth > stack inside the VM?) > > > > Or alternatively, provide your ESX hypervisors with vastly more link > speed, to rule out any L2 induced packet drops - provided your delay > generator is not the source when momentarily overloaded. > > > > > Is there a way to reduce this packet loss by fine tuning some parameter= s > w.r.t ring buffer or any other areas? > > > > Finding where these arise (looking at queue and port counters) would be > the next step. But this is not really my specific area of expertise beyon= d > the high level, vendor independent observations. > > > > Switching to other cc modules may give some more insights. But again, I > suspect that momentary (microsecond) burstiness of BSD may be causing thi= s > significantly higher loss rate. > > > > TCP RACK would be another option. That stack has pacing, more fine-graine= d > timing, the RACK loss recovery mechanisms etc. Maybe that helps reduce th= e > observed packet drops by iperf, and consequently, yield a higher overall > throuhgput. > > > > > > > > > > > > *!! External Email:* This email originated from outside of the > organization. Do not click links or open attachments unless you recognize > the sender. > > > --000000000000139feb05ff9af368 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
I see. Sorry about a straight descri= ption in my previous email.

If you found the = iperf3 report shows bad throughput and increasing numbers in the "Retr= " field, also the "netstat -sp tcp" shows retransmitted pack= ets without SACK recovery episodes (SACK is enabled by default). Then, you = are likely hitting the problem I described, and the root cause is the TX qu= eue drops. The tcpdump trace file won't show any packet retransmissions= and the peer won't be aware of packet loss, as this is a local problem= .

cc@s1:~ % netstat -sp tcp | egrep "tcp:= |retrans|SACK"
tcp:
139 data packets (300416 bytes) retransmitted =C2=A0 =C2=A0 =C2=A0 <= <
0 data packets unnecessarily retransmitted
3 retransmit t= imeouts
0 retransmitted
0 SACK recovery episodes =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0<<
0 segment rexmits in SACK recovery episodes=
0 byte rexmits in SACK recovery episodes
0 SACK options (SACK blocks= ) received
0 SACK options (SACK blocks) sent
0 SACK retransmissions l= ost
0 SACK scoreboard overflow

Local packet dro= ps due to TX full can be found from this command, for example
cc@s1:~ % netstat -i= -I bce4 -nd Name Mtu Network = Address Ipkts Ierrs Idrop Opkts Oerrs Coll Drop bce4 1500 <Link#5> 00:10:18:56:94:d4 286= 184 0 0 148079 0 0 54 << bce4 - 10.1.1.0/24 10.1.1.2 286183 - - 5= 82111 - - -=20 cc@s1:~ %

Hope the above stats can help you better root cause analys= is. Also, increasing the TX queue size is a workaround and is specific to a= particular NIC. But you get the idea.

Best Regards= ,
Cheng Cui


On Mon, Jul 3, 2023 at 11:= 34=E2=80=AFAM Murali Krishnamurthy <muralik1@vmware.com> wrote:

Cheng,

=C2=A0

Thanks for your inputs.

=C2=A0

Sorry, I am not familiar with this area.

=C2=A0

Few queries,

=C2=A0

=E2=80=9CI believe the default values for bce tx/rx = pages are 2. And I happened to find
this problem before that when the tx queue was full, it would not enqueue p= ackets
and started return errors.
And this error was misunderstood by the TCP layer as retransmission.=E2=80= =9D

=C2=A0

Could you please elaborate what is misunderstood by = TCP here? Loss of packets should anyway lead to retransmissions.<= /u>

=C2=A0

Could you point to some stats where I can see such d= rops due to queue getting full?

=C2=A0

I have a vmx interface in my VM and I have attached = the screenshot of ifconfig command for that.

Anything we can understand from that?<= /p>

Will your suggestion of increasing tx_pages=3D4 and = rx_pages=3D4 work for this ? If so, I assume names would be hw.vmx.tx_pages= =3D4 and hw.vmx.rx_pages ?

=C2=A0

Regards

Murali

=C2=A0

=C2=A0

From: Cheng Cui <cc@freebsd.org>
Date: Friday, 30 June 2023 at 10:02 PM
To: Murali Krishnamurthy <muralik1@vmware.com>
Cc: Scheffenegger, Richard <rscheff@freebsd.org>, FreeBSD Transport <freebsd-tran= sport@freebsd.org>
Subject: Re: FreeBSD TCP (with iperf3) comparison with Linux<= u>

!! External Email

I used an emulation testbed from Emulab.net with Dum= mynet traffic shaper adding 100ms RTT
between two nodes, the link capacity is 1Gbps and both nodes are using free= bsd13.2.

cc@s1:~ % ping -c 3 r1
PING r1-link1 (10.1.1.3): 56 data bytes
64 bytes from 10.1.1.3: = icmp_seq=3D0 ttl=3D64 time=3D100.091 ms
64 bytes from 10.1.1.3: = icmp_seq=3D1 ttl=3D64 time=3D99.995 ms
64 bytes from 10.1.1.3: = icmp_seq=3D2 ttl=3D64 time=3D99.979 ms

--- r1-link1 ping statistics ---
3 packets transmitted, 3 packets received, 0.0% packet loss
round-trip min/avg/max/stddev =3D 99.979/100.022/100.091/0.049 ms


cc@s1:~ % iperf3 -c r1 -t 10 -i 1 -C cubic
Connecting to host r1, port 5201
[ =C2=A05] local 10.1.1.2 port 56089 connected to 10.1.1.3 port 5201
[ ID] Interval =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Transfer =C2=A0 =C2=A0 Bi= trate =C2=A0 =C2=A0 =C2=A0 =C2=A0 Retr =C2=A0Cwnd
[ =C2=A05] =C2=A0 0.00-1.00 =C2=A0 sec =C2=A04.19 MBytes =C2=A035.2 Mbits/s= ec =C2=A0 =C2=A00 =C2=A0 1.24 MBytes =C2=A0 =C2=A0 =C2=A0
[ =C2=A05] =C2=A0 1.00-2.00 =C2=A0 sec =C2=A056.5 MBytes =C2=A0 474 Mbits/s= ec =C2=A0 =C2=A06 =C2=A0 2.41 MBytes =C2=A0 =C2=A0 =C2=A0
[ =C2=A05] =C2=A0 2.00-3.00 =C2=A0 sec =C2=A058.6 MBytes =C2=A0 492 Mbits/s= ec =C2=A0 18 =C2=A0 7.17 MBytes =C2=A0 =C2=A0 =C2=A0
[ =C2=A05] =C2=A0 3.00-4.00 =C2=A0 sec =C2=A065.6 MBytes =C2=A0 550 Mbits/s= ec =C2=A0 14 =C2=A0 =C2=A0606 KBytes =C2=A0 =C2=A0 =C2=A0
[ =C2=A05] =C2=A0 4.00-5.00 =C2=A0 sec =C2=A060.8 MBytes =C2=A0 510 Mbits/s= ec =C2=A0 18 =C2=A0 7.22 MBytes =C2=A0 =C2=A0 =C2=A0
[ =C2=A05] =C2=A0 5.00-6.00 =C2=A0 sec =C2=A062.1 MBytes =C2=A0 521 Mbits/s= ec =C2=A0 12 =C2=A0 7.86 MBytes =C2=A0 =C2=A0 =C2=A0
[ =C2=A05] =C2=A0 6.00-7.00 =C2=A0 sec =C2=A060.9 MBytes =C2=A0 512 Mbits/s= ec =C2=A0 14 =C2=A0 3.43 MBytes =C2=A0 =C2=A0 =C2=A0
[ =C2=A05] =C2=A0 7.00-8.00 =C2=A0 sec =C2=A062.8 MBytes =C2=A0 527 Mbits/s= ec =C2=A0 16 =C2=A0 =C2=A0372 KBytes =C2=A0 =C2=A0 =C2=A0
[ =C2=A05] =C2=A0 8.00-9.00 =C2=A0 sec =C2=A059.3 MBytes =C2=A0 497 Mbits/s= ec =C2=A0 14 =C2=A0 1.77 MBytes =C2=A0 =C2=A0 =C2=A0
[ =C2=A05] =C2=A0 9.00-10.00 =C2=A0sec =C2=A057.0 MBytes =C2=A0 477 Mbits/s= ec =C2=A0 18 =C2=A0 7.13 MBytes =C2=A0 =C2=A0 =C2=A0
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Transfer =C2=A0 =C2=A0 Bi= trate =C2=A0 =C2=A0 =C2=A0 =C2=A0 Retr
[ =C2=A05] =C2=A0 0.00-10.00 =C2=A0sec =C2=A0 548 MBytes =C2=A0 459 Mbits/s= ec =C2=A0130 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 sender
[ =C2=A05] =C2=A0 0.00-10.10 =C2=A0sec =C2=A0 540 MBytes =C2=A0 449 Mbits/s= ec =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0receiver
iperf Done.

cc@s1:~ % ifconfig bce4
bce4: flags=3D8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 m= tu 1500
options=3Dc01bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWC= SUM,TSO4,VLAN_HWTSO,LINKSTATE>
ether 00:10:18:56:94:d4
inet 10.1.1.2 netmask 0xffffff00 broadcast 10.1.1.255
media: Ethernet 1000baseT <full-duplex>
status: active
nd6 options=3D29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>

I believe the default values for bce tx/rx pages are 2. And I happened to f= ind
this problem before that when the tx queue was full, it would not enqueue p= ackets
and started return errors.
And this error was misunderstood by the TCP layer as retransmission.

After adding hw.bce.tx_pages=3D4 and hw.bce.rx_pages=3D4 in /boot/loader.co= nf and reboot:

cc@s1:~ % iperf3 -c r1 -t 10 -i 1 -C cubic
Connecting to host r1, port 5201
[ =C2=A05] local 10.1.1.2 port 20478 connected to 10.1.1.3 port 5201
[ ID] Interval =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Transfer =C2=A0 =C2=A0 Bi= trate =C2=A0 =C2=A0 =C2=A0 =C2=A0 Retr =C2=A0Cwnd
[ =C2=A05] =C2=A0 0.00-1.00 =C2=A0 sec =C2=A04.15 MBytes =C2=A034.8 Mbits/s= ec =C2=A0 =C2=A00 =C2=A0 1.17 MBytes =C2=A0 =C2=A0 =C2=A0
[ =C2=A05] =C2=A0 1.00-2.00 =C2=A0 sec =C2=A083.1 MBytes =C2=A0 697 Mbits/s= ec =C2=A0 =C2=A00 =C2=A0 12.2 MBytes =C2=A0 =C2=A0 =C2=A0
[ =C2=A05] =C2=A0 2.00-3.00 =C2=A0 sec =C2=A0 112 MBytes =C2=A0 939 Mbits/s= ec =C2=A0 =C2=A00 =C2=A0 12.2 MBytes =C2=A0 =C2=A0 =C2=A0
[ =C2=A05] =C2=A0 3.00-4.00 =C2=A0 sec =C2=A0 113 MBytes =C2=A0 944 Mbits/s= ec =C2=A0 =C2=A00 =C2=A0 12.2 MBytes =C2=A0 =C2=A0 =C2=A0
[ =C2=A05] =C2=A0 4.00-5.00 =C2=A0 sec =C2=A0 112 MBytes =C2=A0 940 Mbits/s= ec =C2=A0 =C2=A00 =C2=A0 12.2 MBytes =C2=A0 =C2=A0 =C2=A0
[ =C2=A05] =C2=A0 5.00-6.00 =C2=A0 sec =C2=A0 112 MBytes =C2=A0 942 Mbits/s= ec =C2=A0 =C2=A00 =C2=A0 12.2 MBytes =C2=A0 =C2=A0 =C2=A0
[ =C2=A05] =C2=A0 6.00-7.00 =C2=A0 sec =C2=A0 112 MBytes =C2=A0 938 Mbits/s= ec =C2=A0 =C2=A00 =C2=A0 12.2 MBytes =C2=A0 =C2=A0 =C2=A0
[ =C2=A05] =C2=A0 7.00-8.00 =C2=A0 sec =C2=A0 113 MBytes =C2=A0 944 Mbits/s= ec =C2=A0 =C2=A00 =C2=A0 12.2 MBytes =C2=A0 =C2=A0 =C2=A0
[ =C2=A05] =C2=A0 8.00-9.00 =C2=A0 sec =C2=A0 112 MBytes =C2=A0 938 Mbits/s= ec =C2=A0 =C2=A00 =C2=A0 12.2 MBytes =C2=A0 =C2=A0 =C2=A0
[ =C2=A05] =C2=A0 9.00-10.00 =C2=A0sec =C2=A0 113 MBytes =C2=A0 947 Mbits/s= ec =C2=A0 =C2=A00 =C2=A0 12.2 MBytes =C2=A0 =C2=A0 =C2=A0
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Transfer =C2=A0 =C2=A0 Bi= trate =C2=A0 =C2=A0 =C2=A0 =C2=A0 Retr
[ =C2=A05] =C2=A0 0.00-10.00 =C2=A0sec =C2=A0 985 MBytes =C2=A0 826 Mbits/s= ec =C2=A0 =C2=A00 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 sender
[ =C2=A05] =C2=A0 0.00-10.11 =C2=A0sec =C2=A0 982 MBytes =C2=A0 815 Mbits/s= ec =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0receiver
iperf Done.

=C2=A0

Best Regards,

Cheng Cui

=C2=A0

=C2=A0

On Fri, Jun 30, 2023 at 12:26=E2=80=AFPM Murali Kris= hnamurthy <mura= lik1@vmware.com> wrote:

Richard,

=C2=A0

Appreciate the useful inputs you have shared so far.= Will try to figure out regarding packet drops.

=C2=A0

Regarding HyStart, I see even BSD code base has supp= ort for this. May I know by when can we see that in an release, if not alre= ady available ?

=C2=A0

Regarding this point : =E2=80=9CSwitching to other cc modules may give some more insights. But = again, I suspect that momentary (microsecond) burstiness of BSD may be caus= ing this significantly higher loss rate.=E2=80=9D

Is there some info somewhere where I can understand = more on this in detail?

=C2=A0

Regards

Murali

=C2=A0

=C2=A0

On 30/06/23, 9:35 PM, &= quot;owner-freebsd-transport@freebsd.org" <owner-freebsd-transport@= freebsd.org> wrote:

=C2=A0

Hi Murali,

=C2=A0

> Q. Since you mention two hypervisors - what is = the phyiscal network topology in between these two servers? What theoretica= l link rates would be attainable?

>=C2=A0=C2=A0

> Here is the topology

>

> Iperf end points are on 2 different hypervisors= .

>

>=C2=A0=C2=A0=E2=80=94=E2=80=94=E2=80=94=E2=80=94= =E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2= =80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80= =94=E2=80=94=E2=80=94=E2=80=94=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =E2=80=94=E2= =80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=E2=80=94=E2= =80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94-=E2=80=94=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0<= /p>

> | Linux VM1 |=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0|=C2=A0=C2=A0BSD 13 VM 1=C2=A0=C2=A0|=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 |= =C2=A0=C2=A0Linux VM2=C2=A0=C2=A0|=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0|=C2=A0=C2=A0BSD 1= 3 VM 2=C2=A0=C2=A0|

> |___________|=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0|_ ____ ____ ___ |=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0|___________= |=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0|_ ____ ____ ___ |

> |=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0|=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 |=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 |

>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 |=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0|=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 |

> =E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94= =E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2= =80=94=E2=80=94=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=E2=80=94=E2=80=94= =E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2= =80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94

> |=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 ESX Hypervisor 1=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0|=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 10G link connected via L2 Switch=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0|=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 ESX Hypervisor=C2=A0=C2=A02=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0|

> |=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 |=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2= =80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80= =94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94= =C2=A0=C2=A0 |=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0|

> |=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94= =E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2= =80=94 |=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 |=E2=80=94=E2=80=94=E2= =80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80= =94=E2=80=94=E2=80=94=E2=80=94|

>

>

> Nic is of 10G capacity on both ESX server and i= t has below config.

=C2=A0

=C2=A0

So, when both VMs run on the same Hypervisor, maybe = with another VM to simulate the 100ms delay, can you attain a lossless base= line scenario?

=C2=A0

=C2=A0

> BDP for 16MB Socket buffer: 16 MB * (1000 ms * = 100ms latency) * 8 bits/ 1024 =3D 1.25 Gbps

>

> So theoretically we should see close to 1.25Gbp= s of Bitrate and we see Linux reaching close to this number.<= /p>

=C2=A0

Under no loss, yes.

=C2=A0

=C2=A0

> But BSD is not able to do that.

>

>

> Q. Did you run iperf3? Did the transmitting end= point report any retransmissions between Linux or FBSD hosts?=

>

> Yes, we used iper3. I see Linux doing less numb= er retransmissions compared to BSD.

> On BSD, the best performance was around 600 Mbp= s bitrate and the number of retransmissions for this number seen is around = 32K

> On Linux, the best performance was around 1.15 = Gbps bitrate and the number of retransmissions for this number seen is only= 2K.

> So as you pointed the number of retransmissions= in BSD could be the real issue here.

=C2=A0

There are other cc modules available; but I believe = one major deviation is that Linux can perform mechanisms like hystart; ACKi= ng every packet when the client detects slow start; perform pacing to achieve more uniform packet transmissions.=

=C2=A0

I think the next step would be to find out, at which= queue those packet discards are coming from (external switch? delay genera= tor? Vswitch? Eth stack inside the VM?)

=C2=A0

Or alternatively, provide your ESX hypervisors with = vastly more link speed, to rule out any L2 induced packet drops - provided = your delay generator is not the source when momentarily overloaded.

=C2=A0

> Is there a way to reduce this packet loss by fi= ne tuning some parameters w.r.t ring buffer or any other areas?

=C2=A0

Finding where these arise (looking at queue and port= counters) would be the next step. But this is not really my specific area = of expertise beyond the high level, vendor independent observations.

=C2=A0

Switching to other cc modules may give some more ins= ights. But again, I suspect that momentary (microsecond) burstiness of BSD = may be causing this significantly higher loss rate.

=C2=A0

TCP RACK would be another option. That stack has pac= ing, more fine-grained timing, the RACK loss recovery mechanisms etc. Maybe= that helps reduce the observed packet drops by iperf, and consequently, yield a higher overall throuhgput.

=C2=A0

=C2=A0

=C2=A0

=C2=A0

=C2=A0

!! External Email: This email originated from outside of the organi= zation. Do not click links or open attachments unless you recognize the sender.

=C2=A0

--000000000000139feb05ff9af368--