Date: Fri, 30 Jun 2023 16:26:39 +0000 From: Murali Krishnamurthy <muralik1@vmware.com> To: "Scheffenegger, Richard" <rscheff@freebsd.org>, FreeBSD Transport <freebsd-transport@freebsd.org> Subject: Re: FreeBSD TCP (with iperf3) comparison with Linux Message-ID: <PH0PR05MB100642BD041192E6B7EBDBFE1FB2AA@PH0PR05MB10064.namprd05.prod.outlook.com> In-Reply-To: <53aff274-b1a8-0730-6971-2755c7e7b688@freebsd.org> References: <53aff274-b1a8-0730-6971-2755c7e7b688@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
--_000_PH0PR05MB100642BD041192E6B7EBDBFE1FB2AAPH0PR05MB10064na_ Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable Richard, Appreciate the useful inputs you have shared so far. Will try to figure out= regarding packet drops. Regarding HyStart, I see even BSD code base has support for this. May I kno= w by when can we see that in an release, if not already available ? Regarding this point : =93Switching to other cc modules may give some more = insights. But again, I suspect that momentary (microsecond) burstiness of B= SD may be causing this significantly higher loss rate.=94 Is there some info somewhere where I can understand more on this in detail? Regards Murali On 30/06/23, 9:35 PM, "owner-freebsd-transport@freebsd.org" <owner-freebsd-= transport@freebsd.org> wrote: Hi Murali, > Q. Since you mention two hypervisors - what is the phyiscal network topol= ogy in between these two servers? What theoretical link rates would be atta= inable? > > Here is the topology > > Iperf end points are on 2 different hypervisors. > > =97=97=97=97=97=97=97=97=97=97=97 =97=97=97=97=97=97=97=97=97=97= =97=97=97=97=97=97 = =97=97=97=97=97=97 =97=97=97=97=97= =97-=97 > | Linux VM1 | | BSD 13 VM 1 | = | Linux VM2 | = | BSD 13 VM 2 | > |___________| |_ ____ ____ ___ | = |___________ | = |_ ____ ____ ___ | > | | | = | = | > | | = | = | > =97=97=97=97=97=97=97=97=97=97=97=97=97=97=97 = =97=97=97=97=97=97=97= =97=97=97=97=97=97=97=97 > | ESX Hypervisor 1 | 10G link connected via = L2 Switch | ESX Hypervisor 2 | > | |=97=97=97=97=97=97=97=97= =97=97=97=97=97=97=97=97=97=97=97=97=97=97=97=97 | = | > |=97=97=97=97=97=97=97=97=97=97=97=97=97=97 | = |=97=97=97=97=97=97= =97=97=97=97=97=97=97=97| > > > Nic is of 10G capacity on both ESX server and it has below config. So, when both VMs run on the same Hypervisor, maybe with another VM to simu= late the 100ms delay, can you attain a lossless baseline scenario? > BDP for 16MB Socket buffer: 16 MB * (1000 ms * 100ms latency) * 8 bits/ 1= 024 =3D 1.25 Gbps > > So theoretically we should see close to 1.25Gbps of Bitrate and we see Li= nux reaching close to this number. Under no loss, yes. > But BSD is not able to do that. > > > Q. Did you run iperf3? Did the transmitting endpoint report any retransmi= ssions between Linux or FBSD hosts? > > Yes, we used iper3. I see Linux doing less number retransmissions compare= d to BSD. > On BSD, the best performance was around 600 Mbps bitrate and the number o= f retransmissions for this number seen is around 32K > On Linux, the best performance was around 1.15 Gbps bitrate and the numbe= r of retransmissions for this number seen is only 2K. > So as you pointed the number of retransmissions in BSD could be the real = issue here. There are other cc modules available; but I believe one major deviation is = that Linux can perform mechanisms like hystart; ACKing every packet when th= e client detects slow start; perform pacing to achieve more uniform packet = transmissions. I think the next step would be to find out, at which queue those packet dis= cards are coming from (external switch? delay generator? Vswitch? Eth stack= inside the VM?) Or alternatively, provide your ESX hypervisors with vastly more link speed,= to rule out any L2 induced packet drops - provided your delay generator is= not the source when momentarily overloaded. > Is there a way to reduce this packet loss by fine tuning some parameters = w.r.t ring buffer or any other areas? Finding where these arise (looking at queue and port counters) would be the= next step. But this is not really my specific area of expertise beyond the= high level, vendor independent observations. Switching to other cc modules may give some more insights. But again, I sus= pect that momentary (microsecond) burstiness of BSD may be causing this sig= nificantly higher loss rate. TCP RACK would be another option. That stack has pacing, more fine-grained = timing, the RACK loss recovery mechanisms etc. Maybe that helps reduce the = observed packet drops by iperf, and consequently, yield a higher overall th= rouhgput. --_000_PH0PR05MB100642BD041192E6B7EBDBFE1FB2AAPH0PR05MB10064na_ Content-Type: text/html; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable <html xmlns:o=3D"urn:schemas-microsoft-com:office:office" xmlns:w=3D"urn:sc= hemas-microsoft-com:office:word" xmlns:m=3D"http://schemas.microsoft.com/of= fice/2004/12/omml" xmlns=3D"http://www.w3.org/TR/REC-html40"> <head> <meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3DWindows-1= 252"> <meta name=3D"Generator" content=3D"Microsoft Word 15 (filtered medium)"> <style><!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0cm; font-size:11.0pt; font-family:"Calibri",sans-serif;} span.EmailStyle18 {mso-style-type:personal-reply; font-family:"Calibri",sans-serif; color:windowtext;} .MsoChpDefault {mso-style-type:export-only; font-size:10.0pt; mso-ligatures:none;} @page WordSection1 {size:612.0pt 792.0pt; margin:72.0pt 72.0pt 72.0pt 72.0pt;} div.WordSection1 {page:WordSection1;} --></style> </head> <body lang=3D"EN-IN" link=3D"#0563C1" vlink=3D"#954F72" style=3D"word-wrap:= break-word"> <div class=3D"WordSection1"> <p class=3D"MsoNormal"><span style=3D"mso-fareast-language:EN-US">Richard,<= o:p></o:p></span></p> <p class=3D"MsoNormal"><span style=3D"mso-fareast-language:EN-US"><o:p>&nbs= p;</o:p></span></p> <p class=3D"MsoNormal"><span style=3D"mso-fareast-language:EN-US">Appreciat= e the useful inputs you have shared so far. Will try to figure out regardin= g packet drops.<o:p></o:p></span></p> <p class=3D"MsoNormal"><span style=3D"mso-fareast-language:EN-US"><o:p>&nbs= p;</o:p></span></p> <p class=3D"MsoNormal"><span style=3D"mso-fareast-language:EN-US">Regarding= HyStart, I see even BSD code base has support for this. May I know by when= can we see that in an release, if not already available ?<o:p></o:p></span= ></p> <p class=3D"MsoNormal"><span style=3D"mso-fareast-language:EN-US"><o:p>&nbs= p;</o:p></span></p> <p class=3D"MsoNormal"><span style=3D"mso-fareast-language:EN-US">Regarding= this point : <i>=93</i></span><i>Switching to other cc modules may give some more insigh= ts. But again, I suspect that momentary (microsecond) burstiness of BSD may= be causing this significantly higher loss rate.</i><i><span style=3D"mso-f= areast-language:EN-US">=94</span></i><span style=3D"mso-fareast-language:EN= -US"><o:p></o:p></span></p> <p class=3D"MsoNormal"><span style=3D"mso-fareast-language:EN-US">Is there = some info somewhere where I can understand more on this in detail?</span><o= :p></o:p></p> <p class=3D"MsoNormal"><span style=3D"mso-fareast-language:EN-US"><o:p>&nbs= p;</o:p></span></p> <p class=3D"MsoNormal"><span style=3D"mso-fareast-language:EN-US">Regards<o= :p></o:p></span></p> <p class=3D"MsoNormal"><span style=3D"mso-fareast-language:EN-US">Murali<o:= p></o:p></span></p> <p class=3D"MsoNormal"><span style=3D"mso-fareast-language:EN-US"><o:p>&nbs= p;</o:p></span></p> <p class=3D"MsoNormal"><span style=3D"mso-fareast-language:EN-US"><o:p>&nbs= p;</o:p></span></p> <p class=3D"MsoNormal" style=3D"margin-bottom:12.0pt">On 30/06/23, 9:35 PM,= "owner-freebsd-transport@freebsd.org" <owner-freebsd-transpor= t@freebsd.org> wrote:<o:p></o:p></p> <div> <p class=3D"MsoNormal"><o:p> </o:p></p> </div> <div> <p class=3D"MsoNormal">Hi Murali,<o:p></o:p></p> </div> <div> <p class=3D"MsoNormal"><o:p> </o:p></p> </div> <div> <p class=3D"MsoNormal">> Q. Since you mention two hypervisors - what is = the phyiscal network topology in between these two servers? What theoretica= l link rates would be attainable?<o:p></o:p></p> </div> <div> <p class=3D"MsoNormal">> <o:p></o:p></p> </div> <div> <p class=3D"MsoNormal">> Here is the topology<o:p></o:p></p> </div> <div> <p class=3D"MsoNormal">> <o:p></o:p></p> </div> <div> <p class=3D"MsoNormal">> Iperf end points are on 2 different hypervisors= . <o:p></o:p></p> </div> <div> <p class=3D"MsoNormal">> <o:p></o:p></p> </div> <div> <p class=3D"MsoNormal">> =97=97=97=97=97=97=97=97=97=97=97&nb= sp; =97=97=97=97=97=97=97=97=97=97= =97=97=97=97=97=97 &nb= sp; = &nb= sp; = &nb= sp; = =97=97=97=97=97=97 &n= bsp;  = ; =97=97=97=97=97=97-=97 &nbs= p; &= nbsp; &nbs= p; &= nbsp; &nbs= p; &= nbsp; &nbs= p; &= nbsp; &nbs= p; &= nbsp; <o:p></o:p></p> </div> <div> <p class=3D"MsoNormal">> | Linux VM1 | &nbs= p;| BSD 13 VM 1 | = &nb= sp; = &nb= sp; = &nb= sp; |&nbs= p; Linux VM2 | &nb= sp; | BSD 13 VM 2= |<o:p></o:p></p> </div> <div> <p class=3D"MsoNormal">> |___________| &nbs= p;|_ ____ ____ ___ | &= nbsp; &nbs= p; &= nbsp; &nbs= p; &= nbsp; &nbs= p; |___________ = | &n= bsp; |_ ____ ____ ___ |<o:p></o:p></p> </div> <div> <p class=3D"MsoNormal">> | &nbs= p; | = &nb= sp; | &nbs= p; &= nbsp; &nbs= p; &= nbsp; &nbs= p; &= nbsp; &nbs= p; &= nbsp; | &n= bsp;  = ; &n= bsp; |<o:p></o:p></p> </div> <div> <p class=3D"MsoNormal">> = | &= nbsp; &nbs= p; | = &nb= sp; = &nb= sp; = &nb= sp; = &nb= sp; | &nbs= p; &= nbsp; &nbs= p; |<o:p></o:p></p> </div> <div> <p class=3D"MsoNormal">> =97=97=97=97=97=97=97=97=97=97=97=97=97=97=97&n= bsp;  = ; &n= bsp;  = ; &n= bsp;  = ; &n= bsp; =97=97=97=97=97=97=97=97=97=97=97= =97=97=97=97<o:p></o:p></p> </div> <div> <p class=3D"MsoNormal">> | &nbs= p; ESX Hypervisor 1 &n= bsp; | &nbs= p; 10G link connected via L2 Switch &nbs= p; &= nbsp; | &nb= sp; ESX Hypervisor 2 &n= bsp; |<o:p></o:p></p> </div> <div> <p class=3D"MsoNormal">> | &nbs= p; &= nbsp; &nbs= p; &= nbsp; |=97=97=97=97=97=97=97=97=97=97=97=97=97=97=97=97=97=97=97=97=97=97= =97=97 | &= nbsp; &nbs= p; &= nbsp; &nbs= p; |<o:p></o:p></p> </div> <div> <p class=3D"MsoNormal">> |=97=97=97=97=97=97=97=97=97=97=97=97=97=97 |&n= bsp;  = ; &n= bsp;  = ; &n= bsp;  = ; &n= bsp; |=97=97=97=97=97=97=97=97=97=97=97= =97=97=97|<o:p></o:p></p> </div> <div> <p class=3D"MsoNormal">> <o:p></o:p></p> </div> <div> <p class=3D"MsoNormal">> <o:p></o:p></p> </div> <div> <p class=3D"MsoNormal">> Nic is of 10G capacity on both ESX server and i= t has below config.<o:p></o:p></p> </div> <div> <p class=3D"MsoNormal"><o:p> </o:p></p> </div> <div> <p class=3D"MsoNormal"><o:p> </o:p></p> </div> <div> <p class=3D"MsoNormal">So, when both VMs run on the same Hypervisor, maybe = with another VM to simulate the 100ms delay, can you attain a lossless base= line scenario?<o:p></o:p></p> </div> <div> <p class=3D"MsoNormal"><o:p> </o:p></p> </div> <div> <p class=3D"MsoNormal"><o:p> </o:p></p> </div> <div> <p class=3D"MsoNormal">> BDP for 16MB Socket buffer: 16 MB * (1000 ms * = 100ms latency) * 8 bits/ 1024 =3D 1.25 Gbps<o:p></o:p></p> </div> <div> <p class=3D"MsoNormal">> <o:p></o:p></p> </div> <div> <p class=3D"MsoNormal">> So theoretically we should see close to 1.25Gbp= s of Bitrate and we see Linux reaching close to this number.<o:p></o:p></p> </div> <div> <p class=3D"MsoNormal"><o:p> </o:p></p> </div> <div> <p class=3D"MsoNormal">Under no loss, yes.<o:p></o:p></p> </div> <div> <p class=3D"MsoNormal"><o:p> </o:p></p> </div> <div> <p class=3D"MsoNormal"><o:p> </o:p></p> </div> <div> <p class=3D"MsoNormal">> But BSD is not able to do that.<o:p></o:p></p> </div> <div> <p class=3D"MsoNormal">> <o:p></o:p></p> </div> <div> <p class=3D"MsoNormal">> <o:p></o:p></p> </div> <div> <p class=3D"MsoNormal">> Q. Did you run iperf3? Did the transmitting end= point report any retransmissions between Linux or FBSD hosts?<o:p></o:p></p= > </div> <div> <p class=3D"MsoNormal">> <o:p></o:p></p> </div> <div> <p class=3D"MsoNormal">> Yes, we used iper3. I see Linux doing less numb= er retransmissions compared to BSD. <o:p></o:p></p> </div> <div> <p class=3D"MsoNormal">> On BSD, the best performance was around 600 Mbp= s bitrate and the number of retransmissions for this number seen is around = 32K<o:p></o:p></p> </div> <div> <p class=3D"MsoNormal">> On Linux, the best performance was around 1.15 = Gbps bitrate and the number of retransmissions for this number seen is only= 2K. <o:p></o:p></p> </div> <div> <p class=3D"MsoNormal">> So as you pointed the number of retransmissions= in BSD could be the real issue here.<o:p></o:p></p> </div> <div> <p class=3D"MsoNormal"><o:p> </o:p></p> </div> <div> <p class=3D"MsoNormal">There are other cc modules available; but I believe = one major deviation is that Linux can perform mechanisms like hystart; ACKi= ng every packet when the client detects slow start; perform pacing to achie= ve more uniform packet transmissions.<o:p></o:p></p> </div> <div> <p class=3D"MsoNormal"><o:p> </o:p></p> </div> <div> <p class=3D"MsoNormal">I think the next step would be to find out, at which= queue those packet discards are coming from (external switch? delay genera= tor? Vswitch? Eth stack inside the VM?)<o:p></o:p></p> </div> <div> <p class=3D"MsoNormal"><o:p> </o:p></p> </div> <div> <p class=3D"MsoNormal">Or alternatively, provide your ESX hypervisors with = vastly more link speed, to rule out any L2 induced packet drops - provided = your delay generator is not the source when momentarily overloaded.<o:p></o= :p></p> </div> <div> <p class=3D"MsoNormal"><o:p> </o:p></p> </div> <div> <p class=3D"MsoNormal">> Is there a way to reduce this packet loss by fi= ne tuning some parameters w.r.t ring buffer or any other areas? <o:p></o:p></p> </div> <div> <p class=3D"MsoNormal"><o:p> </o:p></p> </div> <div> <p class=3D"MsoNormal">Finding where these arise (looking at queue and port= counters) would be the next step. But this is not really my specific area = of expertise beyond the high level, vendor independent observations.<o:p></= o:p></p> </div> <div> <p class=3D"MsoNormal"><o:p> </o:p></p> </div> <div> <p class=3D"MsoNormal">Switching to other cc modules may give some more ins= ights. But again, I suspect that momentary (microsecond) burstiness of BSD = may be causing this significantly higher loss rate.<o:p></o:p></p> </div> <div> <p class=3D"MsoNormal"><o:p> </o:p></p> </div> <div> <p class=3D"MsoNormal">TCP RACK would be another option. That stack has pac= ing, more fine-grained timing, the RACK loss recovery mechanisms etc. Maybe= that helps reduce the observed packet drops by iperf, and consequently, yi= eld a higher overall throuhgput.<o:p></o:p></p> </div> <div> <p class=3D"MsoNormal"><o:p> </o:p></p> </div> <div> <p class=3D"MsoNormal"><o:p> </o:p></p> </div> <div> <p class=3D"MsoNormal"><o:p> </o:p></p> </div> <div> <p class=3D"MsoNormal"><o:p> </o:p></p> </div> </div> </body> </html> --_000_PH0PR05MB100642BD041192E6B7EBDBFE1FB2AAPH0PR05MB10064na_--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?PH0PR05MB100642BD041192E6B7EBDBFE1FB2AA>