Date: Mon, 24 May 2021 23:34:16 -0700 From: Kevin Bowling <kevin.bowling@kev009.com> To: Vincenzo Maffione <vmaffione@freebsd.org> Cc: Francois ten Krooden <ftk@nanoteq.com>, Jacques Fourie <jacques.fourie@gmail.com>, Marko Zec <zec@fer.hr>, "freebsd-net@freebsd.org" <freebsd-net@freebsd.org> Subject: Re: Vector Packet Processing (VPP) portability on FreeBSD Message-ID: <CAK7dMtB37iN0HQMuwX-Fk=SR%2Bnc4fZLa-N863%2BNOZe9d1ebG_g@mail.gmail.com> In-Reply-To: <CAK7dMtDWor3KqdEshfaqUH2mgagU%2BvT2M6jgwAwKiNt9J1ec%2Bw@mail.gmail.com> References: <AB9BB4D903F59549B2E27CC033B964D6C4F8BECE@NTQ-EXC.nanoteq.co.za> <91e21d18a4214af4898dd09f11144493@EX16-05.ad.unipi.it> <CA%2BhQ2%2BjQ2fh4TXz02mTxAHJkHBWzfNhd=yRqPG45E7Z4umAsKA@mail.gmail.com> <e778ca61766741b0950585f6b26d8fff@EX16-05.ad.unipi.it> <CA%2BhQ2%2BhzjT5%2BRXmUUV4PpkXkvgQEJb8JrLPY7LqteV9ixeM7Ew@mail.gmail.com> <AB9BB4D903F59549B2E27CC033B964D6C4F8D386@NTQ-EXC.nanoteq.co.za> <CALX0vxA3_eDRJmEGBak=e99nOrBkFYEmdnBHEY9JLTmT7tQ2vQ@mail.gmail.com> <AB9BB4D903F59549B2E27CC033B964D6C4F8D3BB@NTQ-EXC.nanoteq.co.za> <CA%2B_eA9iG=4nemZxM_yETxGTMMC-oXPtMZmWc9DCp%2BqJaCQt4=g@mail.gmail.com> <AB9BB4D903F59549B2E27CC033B964D6C4F8D74A@NTQ-EXC.nanoteq.co.za> <20210517192054.0907beea@x23> <CAK7dMtD2vgzHG4XAxpcUoTnZCpmC2Onwa%2BUd%2Bw1dKb1W_TCxfQ@mail.gmail.com> <CA%2B_eA9joMB4C3=hdP9u0r7TkmeLLPX3=o1nCCqtk84kmkjFQkw@mail.gmail.com> <CAK7dMtDWor3KqdEshfaqUH2mgagU%2BvT2M6jgwAwKiNt9J1ec%2Bw@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
--000000000000c2538805c321b7f6 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable The one other thing I want to mention, what this means in effect is every que ends up limited by EITR on ixgbe (around 30kps with the default settings) whether it=E2=80=99s a TX or RX workload. This ends up working o= k if you have sufficient CPU but seems awkward. On the TX workload we should need a magnitude less interrupts to do 10g. There was some work to adapt AIM to this new combined handler but it is not properly tuned and I=E2=80=99m not = sure it should consider TX at all. Regards, Kevin On Mon, May 24, 2021 at 11:16 PM Kevin Bowling <kevin.bowling@kev009.com> wrote: > I don't fully understand the issue, but in iflib_fast_intr_rxtx > https://cgit.freebsd.org/src/tree/sys/net/iflib.c#n1581 it seems like > we end up re-enabling interrupts per course instead of only handling > spurious cases or some low water threshold (which seems like it would > be tricky to do here). The idea is we want to pump interrupts by > disabling them in the msix_que handler, and then wait to re-enable > only when we have more work to do in the ift_task grouptask. > > It was a lot easier to reason about this with separate TX and RX > interrupts. Doing the combined TXRX is definitely a win in terms of > reducing msi-x vector usage (which is important in a lot of FreeBSD > use cases), but it's tricky to understand. > > My time has been sucked away due to work, so I haven't been looking at > this problem to the depth I want to. I'd be interested in discussing > it further with anyone that is interested in it. > > Regards, > Kevin > > On Tue, May 18, 2021 at 2:11 PM Vincenzo Maffione <vmaffione@freebsd.org> > wrote: > > > > > > > > Il giorno mar 18 mag 2021 alle ore 09:32 Kevin Bowling < > kevin.bowling@kev009.com> ha scritto: > >> > >> > >> > >> On Mon, May 17, 2021 at 10:20 AM Marko Zec <zec@fer.hr> wrote: > >>> > >>> On Mon, 17 May 2021 09:53:25 +0000 > >>> Francois ten Krooden <ftk@Nanoteq.com> wrote: > >>> > >>> > On 2021/05/16 09:22, Vincenzo Maffione wrote: > >>> > > >>> > > > >>> > > Hi, > >>> > > Yes, you are not using emulated netmap mode. > >>> > > > >>> > > In the test setup depicted here > >>> > > https://github.com/ftk-ntq/vpp/wiki/VPP-throughput-using-netmap- > >>> > > interfaces#test-setup > >>> > > I think you should really try to replace VPP with the netmap > >>> > > "bridge" application (tools/tools/netmap/bridge.c), and see what > >>> > > numbers you get. > >>> > > > >>> > > You would run the application this way > >>> > > # bridge -i ix0 -i ix1 > >>> > > and this will forward any traffic between ix0 and ix1 (in both > >>> > > directions). > >>> > > > >>> > > These numbers would give you a better idea of where to look next > >>> > > (e.g. VPP code improvements or system tuning such as NIC > >>> > > interrupts, CPU binding, etc.). > >>> > > >>> > Thank you for the suggestion. > >>> > I did run a test with the bridge this morning, and updated the > >>> > results as well. +-------------+------------------+ > >>> > | Packet Size | Throughput (pps) | > >>> > +-------------+------------------+ > >>> > | 64 bytes | 7.197 Mpps | > >>> > | 128 bytes | 7.638 Mpps | > >>> > | 512 bytes | 2.358 Mpps | > >>> > | 1280 bytes | 964.915 kpps | > >>> > | 1518 bytes | 815.239 kpps | > >>> > +-------------+------------------+ > >>> > >>> I assume you're on 13.0 where netmap throughput is lower compared to > >>> 11.x due to migration of most drivers to iflib (apparently increased > >>> overhead) and different driver defaults. On 11.x I could move 10G li= ne > >>> rate from one ix to another at low CPU freqs, where on 13.x the CPU > >>> must be set to max speed, and still can't do 14.88 Mpps. > >> > >> > >> I believe this issue is in the combined txrx interrupt filter. It is > causing a bunch of unnecessary tx re-arms. > > > > > > Could you please elaborate on that? > > > > TX completion is indeed the one thing that changed considerably with th= e > porting to iflib. And this could be a major contributor to the performanc= e > drop. > > My understanding is that TX interrupts are not really used anymore on > multi-gigabit NICs such as ix or ixl. Instead, "softirqs" are used, meani= ng > that a timer is used to perform TX completion. I don't know what the > motivations were for this design decision. > > I had to decrease the timer period to 90us to ensure timely completion > (see https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D248652). However= , > the timer period is currently not adaptive. > > > > > >> > >> > >>> > >>> #1 thing which changed: default # of packets per ring dropped down fr= om > >>> 2048 (11.x) to 1024 (13.x). Try changing this in /boot/loader.conf: > >>> > >>> dev.ixl.0.iflib.override_nrxds=3D2048 > >>> dev.ixl.0.iflib.override_ntxds=3D2048 > >>> dev.ixl.1.iflib.override_nrxds=3D2048 > >>> dev.ixl.1.iflib.override_ntxds=3D2048 > >>> etc. > >>> > >>> For me this increases the throughput of > >>> bridge -i netmap:ixl0 -i netmap:ixl1 > >>> from 9.3 Mpps to 11.4 Mpps > >>> > >>> #2: default interrupt moderation delays seem to be too long. Combine= d > >>> with increasing the ring sizes, reducing dev.ixl.0.rx_itr from 62 > >>> (default) to 40 increases the throughput further from 11.4 to 14.5 Mp= ps > >>> > >>> Hope this helps, > >>> > >>> Marko > >>> > >>> > >>> > Besides for the 64-byte and 128-byte packets the other sizes where > >>> > matching the maximum rates possible on 10Gbps. This was when the > >>> > bridge application was running on a single core, and the cpu core w= as > >>> > maxing out at a 100%. > >>> > > >>> > I think there might be a bit of system tuning needed, but I suspect > >>> > most of the improvement would be needed in VPP. > >>> > > >>> > Regards > >>> > Francois > >>> _______________________________________________ > >>> freebsd-net@freebsd.org mailing list > >>> https://lists.freebsd.org/mailman/listinfo/freebsd-net > >>> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org= " > --000000000000c2538805c321b7f6 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"auto">The one other thing I want to mention, what this means in= effect is every que ends up limited by EITR on ixgbe (around 30kps with th= e default settings) whether it=E2=80=99s a TX or RX workload.=C2=A0 This en= ds up working ok if you have sufficient CPU but seems awkward.=C2=A0 On the= TX workload we should need a magnitude less interrupts to do 10g. There wa= s some work to adapt AIM to this new combined handler but it is not properl= y tuned and I=E2=80=99m not sure it should consider TX at all.</div><div di= r=3D"auto"><br></div><div dir=3D"auto">Regards,</div><div dir=3D"auto">Kevi= n</div><div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_= attr">On Mon, May 24, 2021 at 11:16 PM Kevin Bowling <<a href=3D"mailto:= kevin.bowling@kev009.com">kevin.bowling@kev009.com</a>> wrote:<br></div>= <blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-= left-width:1px;border-left-style:solid;padding-left:1ex;border-left-color:r= gb(204,204,204)">I don't fully understand the issue, but in iflib_fast_= intr_rxtx<br> <a href=3D"https://cgit.freebsd.org/src/tree/sys/net/iflib.c#n1581" rel=3D"= noreferrer" target=3D"_blank">https://cgit.freebsd.org/src/tree/sys/net/ifl= ib.c#n1581</a> it seems like<br> we end up re-enabling interrupts per course instead of only handling<br> spurious cases or some low water threshold (which seems like it would<br> be tricky to do here).=C2=A0 The idea is we want to pump interrupts by<br> disabling them in the msix_que handler, and then wait to re-enable<br> only when we have more work to do in the ift_task grouptask.<br> <br> It was a lot easier to reason about this with separate TX and RX<br> interrupts.=C2=A0 Doing the combined TXRX is definitely a win in terms of<b= r> reducing msi-x vector usage (which is important in a lot of FreeBSD<br> use cases), but it's tricky to understand.<br> <br> My time has been sucked away due to work, so I haven't been looking at<= br> this problem to the depth I want to.=C2=A0 I'd be interested in discuss= ing<br> it further with anyone that is interested in it.<br> <br> Regards,<br> Kevin<br> <br> On Tue, May 18, 2021 at 2:11 PM Vincenzo Maffione <<a href=3D"mailto:vma= ffione@freebsd.org" target=3D"_blank">vmaffione@freebsd.org</a>> wrote:<= br> ><br> ><br> ><br> > Il giorno mar 18 mag 2021 alle ore 09:32 Kevin Bowling <<a href=3D"= mailto:kevin.bowling@kev009.com" target=3D"_blank">kevin.bowling@kev009.com= </a>> ha scritto:<br> >><br> >><br> >><br> >> On Mon, May 17, 2021 at 10:20 AM Marko Zec <<a href=3D"mailto:z= ec@fer.hr" target=3D"_blank">zec@fer.hr</a>> wrote:<br> >>><br> >>> On Mon, 17 May 2021 09:53:25 +0000<br> >>> Francois ten Krooden <ftk@Nanoteq.com> wrote:<br> >>><br> >>> > On 2021/05/16 09:22, Vincenzo Maffione wrote:<br> >>> ><br> >>> > ><br> >>> > > Hi,<br> >>> > >=C2=A0 =C2=A0Yes, you are not using emulated netmap m= ode.<br> >>> > ><br> >>> > >=C2=A0 =C2=A0In the test setup depicted here<br> >>> > > <a href=3D"https://github.com/ftk-ntq/vpp/wiki/VPP-t= hroughput-using-netmap-" rel=3D"noreferrer" target=3D"_blank">https://githu= b.com/ftk-ntq/vpp/wiki/VPP-throughput-using-netmap-</a><br> >>> > > interfaces#test-setup<br> >>> > > I think you should really try to replace VPP with th= e netmap<br> >>> > > "bridge" application (tools/tools/netmap/b= ridge.c), and see what<br> >>> > > numbers you get.<br> >>> > ><br> >>> > > You would run the application this way<br> >>> > > # bridge -i ix0 -i ix1<br> >>> > > and this will forward any traffic between ix0 and ix= 1 (in both<br> >>> > > directions).<br> >>> > ><br> >>> > > These numbers would give you a better idea of where = to look next<br> >>> > > (e.g. VPP code improvements or system tuning such as= NIC<br> >>> > > interrupts, CPU binding, etc.).<br> >>> ><br> >>> > Thank you for the suggestion.<br> >>> > I did run a test with the bridge this morning, and update= d the<br> >>> > results as well. +-------------+------------------+<br> >>> > | Packet Size | Throughput (pps) |<br> >>> > +-------------+------------------+<br> >>> > |=C2=A0 =C2=A064 bytes=C2=A0 |=C2=A0 =C2=A0 7.197 Mpps=C2= =A0 =C2=A0 |<br> >>> > |=C2=A0 128 bytes=C2=A0 |=C2=A0 =C2=A0 7.638 Mpps=C2=A0 = =C2=A0 |<br> >>> > |=C2=A0 512 bytes=C2=A0 |=C2=A0 =C2=A0 2.358 Mpps=C2=A0 = =C2=A0 |<br> >>> > | 1280 bytes=C2=A0 |=C2=A0 964.915 kpps=C2=A0 =C2=A0 |<br= > >>> > | 1518 bytes=C2=A0 |=C2=A0 815.239 kpps=C2=A0 =C2=A0 |<br= > >>> > +-------------+------------------+<br> >>><br> >>> I assume you're on 13.0 where netmap throughput is lower c= ompared to<br> >>> 11.x due to migration of most drivers to iflib (apparently inc= reased<br> >>> overhead) and different driver defaults.=C2=A0 On 11.x I could= move 10G line<br> >>> rate from one ix to another at low CPU freqs, where on 13.x th= e CPU<br> >>> must be set to max speed, and still can't do 14.88 Mpps.<b= r> >><br> >><br> >> I believe this issue is in the combined txrx interrupt filter.=C2= =A0 It is causing a bunch of unnecessary tx re-arms.<br> ><br> ><br> > Could you please elaborate on that?<br> ><br> > TX completion is indeed the one thing that changed considerably with t= he porting to iflib. And this could be a major contributor to the performan= ce drop.<br> > My understanding is that TX interrupts are not really used anymore on = multi-gigabit NICs such as ix or ixl. Instead, "softirqs" are use= d, meaning that a timer is used to perform TX completion. I don't know = what the motivations were for this design decision.<br> > I had to decrease the timer period to 90us to ensure timely completion= (see <a href=3D"https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D248652= " rel=3D"noreferrer" target=3D"_blank">https://bugs.freebsd.org/bugzilla/sh= ow_bug.cgi?id=3D248652</a>). However, the timer period is currently not ada= ptive.<br> ><br> ><br> >><br> >><br> >>><br> >>> #1 thing which changed: default # of packets per ring dropped = down from<br> >>> 2048 (11.x) to 1024 (13.x).=C2=A0 Try changing this in /boot/l= oader.conf:<br> >>><br> >>> dev.ixl.0.iflib.override_nrxds=3D2048<br> >>> dev.ixl.0.iflib.override_ntxds=3D2048<br> >>> dev.ixl.1.iflib.override_nrxds=3D2048<br> >>> dev.ixl.1.iflib.override_ntxds=3D2048<br> >>> etc.<br> >>><br> >>> For me this increases the throughput of<br> >>> bridge -i netmap:ixl0 -i netmap:ixl1<br> >>> from 9.3 Mpps to 11.4 Mpps<br> >>><br> >>> #2: default interrupt moderation delays seem to be too long.= =C2=A0 Combined<br> >>> with increasing the ring sizes, reducing dev.ixl.0.rx_itr from= 62<br> >>> (default) to 40 increases the throughput further from 11.4 to = 14.5 Mpps<br> >>><br> >>> Hope this helps,<br> >>><br> >>> Marko<br> >>><br> >>><br> >>> > Besides for the 64-byte and 128-byte packets the other si= zes where<br> >>> > matching the maximum rates possible on 10Gbps. This was w= hen the<br> >>> > bridge application was running on a single core, and the = cpu core was<br> >>> > maxing out at a 100%.<br> >>> ><br> >>> > I think there might be a bit of system tuning needed, but= I suspect<br> >>> > most of the improvement would be needed in VPP.<br> >>> ><br> >>> > Regards<br> >>> > Francois<br> >>> _______________________________________________<br> >>> <a href=3D"mailto:freebsd-net@freebsd.org" target=3D"_blank">f= reebsd-net@freebsd.org</a> mailing list<br> >>> <a href=3D"https://lists.freebsd.org/mailman/listinfo/freebsd-= net" rel=3D"noreferrer" target=3D"_blank">https://lists.freebsd.org/mailman= /listinfo/freebsd-net</a><br> >>> To unsubscribe, send any mail to "<a href=3D"mailto:freeb= sd-net-unsubscribe@freebsd.org" target=3D"_blank">freebsd-net-unsubscribe@f= reebsd.org</a>"<br> </blockquote></div></div> --000000000000c2538805c321b7f6--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAK7dMtB37iN0HQMuwX-Fk=SR%2Bnc4fZLa-N863%2BNOZe9d1ebG_g>