Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 8 Dec 2025 08:33:25 -0800
From:      Adrian Chadd <adrian@freebsd.org>
To:        Kajetan Staszkiewicz <vegeta@tuxpowered.net>
Cc:        Konstantin Belousov <kostikbel@gmail.com>, net@freebsd.org
Subject:   Re: RSS causing bad forwarding performance?
Message-ID:  <CAJ-VmokOkfAprkPcuomXF4cMM7uRKx-=t-cP1T5%2BvKNfk-nSaA@mail.gmail.com>
In-Reply-To: <f39ea972-da4b-4310-b5f0-7a6133ca0f0c@tuxpowered.net>
References:  <db616c5b-55bb-40dc-aa70-0b132b06bb75@tuxpowered.net> <aTYT1zK__eQfBZ0M@kib.kiev.ua> <f39ea972-da4b-4310-b5f0-7a6133ca0f0c@tuxpowered.net>

index | next in thread | previous in thread | raw e-mail

On Mon, 8 Dec 2025 at 06:24, Kajetan Staszkiewicz <vegeta@tuxpowered.net> wrote:
>
> On 2025-12-08 00:55, Konstantin Belousov wrote:
>
> > It is somewhat strange that with/without RSS results differ for UDP.
> > mlx5en driver always enable hashing the packet into rx queue.  And,
> > with single UDP stream I would expect all packets to hit the same queue.
> With a single UDP stream and RSS disabled the DUT gets 2 CPU cores
> loaded. One at 100%, I understand this is where the interrupts for
> incoming packets land and it handles receiving, forwarding and sending
> the packet (with direct ISR dispatch) and another around 15-20%, my best
> guess that it's handling interrupts for confirmations of packets sent
> out through the other NIC.
>
> With a single UDP stream and RSS enabled the DUT gets only 1 CPU core
> loaded. I understand that thanks to RSS the outbound queue on mce1 is
> the same as inbound queue on mce0 and thus the same CPU core handles irq
> for both queues.
>
> > As consequence, with/without RSS should be same (low).
>
> It is low for no RSS, but with RSS it's not just low, it's terrible.
>
> > Would it be UDP which encapsulates some other traffic, e.g. tunnel that
> > can be further classified by the internal headers, like inner headers
> > of the vxlan, then more that one receive queue could be used.
>
> The script stl/udp_1pkt_simple.py (provided with TRex) creates UDP
> packets from port 1025 to port 12, filled with 0x78, length 10 B. My
> goal is to test packets per second performance, so I've choosen this
> test as it creates very short packets.
>
> > BTW, mce cards have huge numbers of supported offloads, but all of them are
> > host-oriented, they would not help for the forwarding.
>
> > Again, since iperf stream would hit single send/receive queue.
> > Parallel iperfs between same machines scale.
>
> It seems that parallel streams forwarded through the machine scale too.
> It's a single stream that kills it, and only with option RSS enabled.

RSS was never really designed for optimising a single flow by having it consume
two CPU cores.

It was designed for optimising a /whole lot of flows/ by directing
them to a consistent
CPU mapping and if used in conjunction with CPU selection for the transmit side,
to avoid cross-CPU locking/synchronisation entirely.

It doesn't help that the RSS defaults (ie only one netisr, not hybrid
mode IIRC, etc)
are not the best for lots of flows.

So in short, I think you're testing the wrong thing.



-adrian


help

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-VmokOkfAprkPcuomXF4cMM7uRKx-=t-cP1T5%2BvKNfk-nSaA>