Date: Fri, 19 Jul 2024 12:07:15 +0900 From: Junho Choi <junho.choi@gmail.com> To: tuexen@freebsd.org Cc: Alan Somers <asomers@freebsd.org>, FreeBSD Net <freebsd-net@freebsd.org> Subject: Re: TCP Success Story (was Re: TCP_RACK, TCP_BBR, and firewalls) Message-ID: <CAJ5e%2BHAV%2BH_j5g9MEyTU_2A=_xi04JL5x7jPt1UEYd8Q-EyGnA@mail.gmail.com> In-Reply-To: <400A46A2-E75F-4BE3-BFFF-340CF4557322@freebsd.org> References: <CAOtMX2iLv5OW4jQiBOHqMvcqkQSznTyO-eWMrOcHWbpeyaeRsg@mail.gmail.com> <C7467BCD-7232-4C6C-873E-EEC2482214A7@freebsd.org> <CAOtMX2hGYfm0U0L25-vHSX0iOyKCbZydaAzye6Y6U59mQeF7rA@mail.gmail.com> <B86DCBA6-542F-4951-A726-3A66D3D640D6@freebsd.org> <CAJ5e%2BHAvNbazCkd_G_E=QojqknQe23khCimyKWk=TTyzHr2j0Q@mail.gmail.com> <B2A62C1B-9BD4-4F82-A296-07A3B41CA402@freebsd.org> <CAOtMX2i5-7=qvPyb-tbJjkKwSKv6mawxZ-jeHG9UaPi2AY6CRg@mail.gmail.com> <400A46A2-E75F-4BE3-BFFF-340CF4557322@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
--0000000000004e85a0061d9100d9 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable RACK is a loss detection algorithm and BBR is a congestion control algorithm so it's on a different layer. e.g. linux can configure them independently. However in FreeBSD it looks like it is using the same configuration sysctl (net.inet.tcp.functions_default=3Dtcp_rack|tcp_bbr), so not able to set it both. Is there any plan to improve it? or does tcp_bbr include tcp_rack's loss probe behavior? A little confused. Best, On Fri, Jul 19, 2024 at 4:23=E2=80=AFAM <tuexen@freebsd.org> wrote: > > On 18. Jul 2024, at 20:37, Alan Somers <asomers@freebsd.org> wrote: > > > > Coexist how? Do you mean that one socket can use one and a different > > socket uses the other? That makes sense. > Correct. > > Best regards > Michael > > > > On Thu, Jul 18, 2024 at 10:34=E2=80=AFAM <tuexen@freebsd.org> wrote: > >> > >>> On 18. Jul 2024, at 15:00, Junho Choi <junho.choi@gmail.com> wrote: > >>> > >>> Alan - this is a great result to see. Thanks for experimenting. > >>> > >>> Just curious why bbr and rack don't co-exist? Those are two separate > things. > >>> Is it a current bug or by design? > >> Technically RACK and BBR can coexist. The problem was with pf and/or > LRO. > >> > >> But this is all fixed now in 14.1 and head. > >> > >> Best regards > >> Michael > >>> > >>> BR, > >>> > >>> On Thu, Jul 18, 2024 at 5:27=E2=80=AFAM <tuexen@freebsd.org> wrote: > >>>> On 17. Jul 2024, at 22:00, Alan Somers <asomers@freebsd.org> wrote: > >>>> > >>>> On Sat, Jul 13, 2024 at 1:50=E2=80=AFAM <tuexen@freebsd.org> wrote: > >>>>> > >>>>>> On 13. Jul 2024, at 01:43, Alan Somers <asomers@FreeBSD.org> wrote= : > >>>>>> > >>>>>> I've been experimenting with RACK and BBR. In my environment, the= y > >>>>>> can dramatically improve single-stream TCP performance, which is > >>>>>> awesome. But pf interferes. I have to disable pf in order for th= em > >>>>>> to work at all. > >>>>>> > >>>>>> Is this a known limitation? If not, I will experiment some more t= o > >>>>>> determine exactly what aspect of my pf configuration is responsibl= e. > >>>>>> If so, can anybody suggest what changes would have to happen to ma= ke > >>>>>> the two compatible? > >>>>> A problem with same symptoms was already reported and fixed in > >>>>> https://reviews.freebsd.org/D43769 > >>>>> > >>>>> Which version are you using? > >>>>> > >>>>> Best regards > >>>>> Michael > >>>>>> > >>>>>> -Alan > >>>> > >>>> TLDR; tcp_rack is good, cc_chd is better, and tcp_bbr is best > >>>> > >>>> I want to follow up with the list to post my conclusions. Firstly > >>>> tuexen@ helped me solve my problem: in FreeBSD 14.0 there is a 3-way > >>>> incompatibility between (tcp_bbr || tcp_rack) && lro && pf. I can > >>>> confirm that tcp_bbr works for me if I either disable LRO, disable P= F, > >>>> or switch to a 14.1 server. > >>>> > >>>> Here's the real problem: on multiple production servers, downloading > >>>> large files (or ZFS send/recv streams) was slow. After ruling out > >>>> many possible causes, wireshark revealed that the connection was > >>>> suffering about 0.05% packet loss. I don't know the source of that > >>>> packet loss, but I don't believe it to be congestion-related. Along > >>>> with a 54ms RTT, that's a fatal combination for the throughput of > >>>> loss-based congestion control algorithms. According to the Mathis > >>>> Formula [1], I could only expect 1.1 MBps over such a connection. > >>>> That's actually worse than what I saw. With default settings > >>>> (cc_cubic), I averaged 5.6 MBps. Probably Mathis's assumptions are > >>>> outdated, but that's still pretty close for such a simple formula > >>>> that's 27 years old. > >>>> > >>>> So I benchmarked all available congestion control algorithms for > >>>> single download streams. The results are summarized in the table > >>>> below. > >>>> > >>>> Algo Packet Loss Rate Average Throughput > >>>> vegas 0.05% 2.0 MBps > >>>> newreno 0.05% 3.2 MBps > >>>> cubic 0.05% 5.6 MBps > >>>> hd 0.05% 8.6 MBps > >>>> cdg 0.05% 13.5 MBps > >>>> rack 0.04% 14 MBps > >>>> htcp 0.05% 15 MBps > >>>> dctcp 0.05% 15 MBps > >>>> chd 0.05% 17.3 MBps > >>>> bbr 0.05% 29.2 MBps > >>>> cubic 10% 159 kBps > >>>> chd 10% 208 kBps > >>>> bbr 10% 5.7 MBps > >>>> > >>>> RACK seemed to achieve about the same maximum bandwidth as BBR, thou= gh > >>>> it took a lot longer to get there. Also, with RACK, wireshark > >>>> reported about 10x as many retransmissions as dropped packets, which > >>>> is suspicious. > >>>> > >>>> At one point, something went haywire and packet loss briefly spiked = to > >>>> the neighborhood of 10%. I took advantage of the chaos to repeat my > >>>> measurements. As the table shows, all algorithms sucked under those > >>>> conditions, but BBR sucked impressively less than the others. > >>>> > >>>> Disclaimer: there was significant run-to-run variation; the presente= d > >>>> results are averages. And I did not attempt to measure packet loss > >>>> exactly for most runs; 0.05% is merely an average of a few selected > >>>> runs. These measurements were taken on a production server running = a > >>>> real workload, which introduces noise. Soon I hope to have the > >>>> opportunity to repeat the experiment on an idle server in the same > >>>> environment. > >>>> > >>>> In conclusion, while we'd like to use BBR, we really can't until we > >>>> upgrade to 14.1, which hopefully will be soon. So in the meantime > >>>> we've switched all relevant servers from cubic to chd, and we'll > >>>> reevaluate BBR after the upgrade. > >>> Hi Alan, > >>> > >>> just to be clear: the version of BBR currently implemented is > >>> BBR version 1, which is known to be unfair in certain scenarios. > >>> Google is still working on BBR to address this problem and improve > >>> it in other aspects. But there is no RFC yet and the updates haven't > >>> been implemented yet in FreeBSD. > >>> > >>> Best regards > >>> Michael > >>>> > >>>> [1]: https://www.slac.stanford.edu/comp/net/wan-mon/thru-vs-loss.htm= l > >>>> > >>>> -Alan > >>> > >>> > >>> > >>> > >>> -- > >>> Junho Choi <junho dot choi at gmail.com> | https://saturnsoft.net > >> > > > > > --=20 Junho Choi <junho dot choi at gmail.com> | https://saturnsoft.net --0000000000004e85a0061d9100d9 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div>RACK is a loss detection algorithm and BBR is a conge= stion control algorithm so it's on a different layer.</div><div>e.g. li= nux can configure them independently.<br></div><div><br></div><div>However = in FreeBSD it looks like it is using the same configuration sysctl (net.ine= t.tcp.functions_default=3Dtcp_rack|tcp_bbr),</div><div>so not able to set i= t both.</div><div><br></div><div>Is there any plan to improve it? or does t= cp_bbr include tcp_rack's loss probe behavior?</div><div><br></div><div= >A little confused.</div><div><br></div><div>Best,<br></div><div><br></div>= </div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">= On Fri, Jul 19, 2024 at 4:23=E2=80=AFAM <<a href=3D"mailto:tuexen@freebs= d.org">tuexen@freebsd.org</a>> wrote:<br></div><blockquote class=3D"gmai= l_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,20= 4,204);padding-left:1ex">> On 18. Jul 2024, at 20:37, Alan Somers <<a= href=3D"mailto:asomers@freebsd.org" target=3D"_blank">asomers@freebsd.org<= /a>> wrote:<br> > <br> > Coexist how?=C2=A0 Do you mean that one socket can use one and a diffe= rent<br> > socket uses the other?=C2=A0 That makes sense.<br> Correct.<br> <br> Best regards<br> Michael<br> > <br> > On Thu, Jul 18, 2024 at 10:34=E2=80=AFAM <<a href=3D"mailto:tuexen@= freebsd.org" target=3D"_blank">tuexen@freebsd.org</a>> wrote:<br> >> <br> >>> On 18. Jul 2024, at 15:00, Junho Choi <<a href=3D"mailto:ju= nho.choi@gmail.com" target=3D"_blank">junho.choi@gmail.com</a>> wrote:<b= r> >>> <br> >>> Alan - this is a great result to see. Thanks for experimenting= .<br> >>> <br> >>> Just curious why bbr and rack don't co-exist? Those are tw= o separate things.<br> >>> Is it a current bug or by design?<br> >> Technically RACK and BBR can coexist. The problem was with pf and/= or LRO.<br> >> <br> >> But this is all fixed now in 14.1 and head.<br> >> <br> >> Best regards<br> >> Michael<br> >>> <br> >>> BR,<br> >>> <br> >>> On Thu, Jul 18, 2024 at 5:27=E2=80=AFAM <<a href=3D"mailto:= tuexen@freebsd.org" target=3D"_blank">tuexen@freebsd.org</a>> wrote:<br> >>>> On 17. Jul 2024, at 22:00, Alan Somers <<a href=3D"mail= to:asomers@freebsd.org" target=3D"_blank">asomers@freebsd.org</a>> wrote= :<br> >>>> <br> >>>> On Sat, Jul 13, 2024 at 1:50=E2=80=AFAM <<a href=3D"mai= lto:tuexen@freebsd.org" target=3D"_blank">tuexen@freebsd.org</a>> wrote:= <br> >>>>> <br> >>>>>> On 13. Jul 2024, at 01:43, Alan Somers <asomers= @FreeBSD.org> wrote:<br> >>>>>> <br> >>>>>> I've been experimenting with RACK and BBR.=C2= =A0 In my environment, they<br> >>>>>> can dramatically improve single-stream TCP perform= ance, which is<br> >>>>>> awesome.=C2=A0 But pf interferes.=C2=A0 I have to = disable pf in order for them<br> >>>>>> to work at all.<br> >>>>>> <br> >>>>>> Is this a known limitation?=C2=A0 If not, I will e= xperiment some more to<br> >>>>>> determine exactly what aspect of my pf configurati= on is responsible.<br> >>>>>> If so, can anybody suggest what changes would have= to happen to make<br> >>>>>> the two compatible?<br> >>>>> A problem with same symptoms was already reported and = fixed in<br> >>>>> <a href=3D"https://reviews.freebsd.org/D43769" rel=3D"= noreferrer" target=3D"_blank">https://reviews.freebsd.org/D43769</a><br> >>>>> <br> >>>>> Which version are you using?<br> >>>>> <br> >>>>> Best regards<br> >>>>> Michael<br> >>>>>> <br> >>>>>> -Alan<br> >>>> <br> >>>> TLDR; tcp_rack is good, cc_chd is better, and tcp_bbr is b= est<br> >>>> <br> >>>> I want to follow up with the list to post my conclusions.= =C2=A0 Firstly<br> >>>> tuexen@ helped me solve my problem: in FreeBSD 14.0 there = is a 3-way<br> >>>> incompatibility between (tcp_bbr || tcp_rack) && l= ro && pf.=C2=A0 I can<br> >>>> confirm that tcp_bbr works for me if I either disable LRO,= disable PF,<br> >>>> or switch to a 14.1 server.<br> >>>> <br> >>>> Here's the real problem: on multiple production server= s, downloading<br> >>>> large files (or ZFS send/recv streams) was slow.=C2=A0 Aft= er ruling out<br> >>>> many possible causes, wireshark revealed that the connecti= on was<br> >>>> suffering about 0.05% packet loss.=C2=A0 I don't know = the source of that<br> >>>> packet loss, but I don't believe it to be congestion-r= elated.=C2=A0 Along<br> >>>> with a 54ms RTT, that's a fatal combination for the th= roughput of<br> >>>> loss-based congestion control algorithms.=C2=A0 According = to the Mathis<br> >>>> Formula [1], I could only expect 1.1 MBps over such a conn= ection.<br> >>>> That's actually worse than what I saw.=C2=A0 With defa= ult settings<br> >>>> (cc_cubic), I averaged 5.6 MBps.=C2=A0 Probably Mathis'= ;s assumptions are<br> >>>> outdated, but that's still pretty close for such a sim= ple formula<br> >>>> that's 27 years old.<br> >>>> <br> >>>> So I benchmarked all available congestion control algorith= ms for<br> >>>> single download streams.=C2=A0 The results are summarized = in the table<br> >>>> below.<br> >>>> <br> >>>> Algo=C2=A0 =C2=A0 Packet Loss Rate=C2=A0 =C2=A0 Average Th= roughput<br> >>>> vegas=C2=A0 =C2=A00.05%=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A02.0 MBps<br> >>>> newreno 0.05%=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A03.2 MBps<br> >>>> cubic=C2=A0 =C2=A00.05%=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A05.6 MBps<br> >>>> hd=C2=A0 =C2=A0 =C2=A0 0.05%=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A08.6 MBps<br> >>>> cdg=C2=A0 =C2=A0 =C2=A00.05%=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A013.5 MBps<br> >>>> rack=C2=A0 =C2=A0 0.04%=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A014 MBps<br> >>>> htcp=C2=A0 =C2=A0 0.05%=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A015 MBps<br> >>>> dctcp=C2=A0 =C2=A00.05%=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A015 MBps<br> >>>> chd=C2=A0 =C2=A0 =C2=A00.05%=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A017.3 MBps<br> >>>> bbr=C2=A0 =C2=A0 =C2=A00.05%=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A029.2 MBps<br> >>>> cubic=C2=A0 =C2=A010%=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0159 kBps<br> >>>> chd=C2=A0 =C2=A0 =C2=A010%=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0208 kBps<br> >>>> bbr=C2=A0 =C2=A0 =C2=A010%=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A05.7 MBps<br> >>>> <br> >>>> RACK seemed to achieve about the same maximum bandwidth as= BBR, though<br> >>>> it took a lot longer to get there.=C2=A0 Also, with RACK, = wireshark<br> >>>> reported about 10x as many retransmissions as dropped pack= ets, which<br> >>>> is suspicious.<br> >>>> <br> >>>> At one point, something went haywire and packet loss brief= ly spiked to<br> >>>> the neighborhood of 10%.=C2=A0 I took advantage of the cha= os to repeat my<br> >>>> measurements.=C2=A0 As the table shows, all algorithms suc= ked under those<br> >>>> conditions, but BBR sucked impressively less than the othe= rs.<br> >>>> <br> >>>> Disclaimer: there was significant run-to-run variation; th= e presented<br> >>>> results are averages.=C2=A0 And I did not attempt to measu= re packet loss<br> >>>> exactly for most runs; 0.05% is merely an average of a few= selected<br> >>>> runs.=C2=A0 These measurements were taken on a production = server running a<br> >>>> real workload, which introduces noise.=C2=A0 Soon I hope t= o have the<br> >>>> opportunity to repeat the experiment on an idle server in = the same<br> >>>> environment.<br> >>>> <br> >>>> In conclusion, while we'd like to use BBR, we really c= an't until we<br> >>>> upgrade to 14.1, which hopefully will be soon.=C2=A0 So in= the meantime<br> >>>> we've switched all relevant servers from cubic to chd,= and we'll<br> >>>> reevaluate BBR after the upgrade.<br> >>> Hi Alan,<br> >>> <br> >>> just to be clear: the version of BBR currently implemented is<= br> >>> BBR version 1, which is known to be unfair in certain scenario= s.<br> >>> Google is still working on BBR to address this problem and imp= rove<br> >>> it in other aspects. But there is no RFC yet and the updates h= aven't<br> >>> been implemented yet in FreeBSD.<br> >>> <br> >>> Best regards<br> >>> Michael<br> >>>> <br> >>>> [1]: <a href=3D"https://www.slac.stanford.edu/comp/net/wan= -mon/thru-vs-loss.html" rel=3D"noreferrer" target=3D"_blank">https://www.sl= ac.stanford.edu/comp/net/wan-mon/thru-vs-loss.html</a><br> >>>> <br> >>>> -Alan<br> >>> <br> >>> <br> >>> <br> >>> <br> >>> --<br> >>> Junho Choi <junho dot choi at <a href=3D"http://gmail.com" = rel=3D"noreferrer" target=3D"_blank">gmail.com</a>> | <a href=3D"https:/= /saturnsoft.net" rel=3D"noreferrer" target=3D"_blank">https://saturnsoft.ne= t</a><br> >> <br> > <br> <br> <br> </blockquote></div><br clear=3D"all"><br><span class=3D"gmail_signature_pre= fix">-- </span><br><div dir=3D"ltr" class=3D"gmail_signature"><div dir=3D"l= tr"><div><div dir=3D"ltr">Junho Choi <junho dot choi at <a href=3D"http:= //gmail.com" target=3D"_blank">gmail.com</a>> | <a href=3D"https://satur= nsoft.net" target=3D"_blank">https://saturnsoft.net</a><br></div></div></di= v></div> --0000000000004e85a0061d9100d9--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ5e%2BHAV%2BH_j5g9MEyTU_2A=_xi04JL5x7jPt1UEYd8Q-EyGnA>