Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 9 Jul 2021 09:16:59 +0000
From:      Wei Hu via freebsd-net <freebsd-net@freebsd.org>
To:        "freebsd-net@FreeBSD.org" <freebsd-net@freebsd.org>
Cc:        "imp@bsdimp.com" <imp@bsdimp.com>, Li-Wen Hsu <lwhsu@freebsd.org>
Subject:   Send path scaling problem
Message-ID:  <SI2P153MB044159D903A78C18B6181544BB189@SI2P153MB0441.APCP153.PROD.OUTLOOK.COM>

next in thread | raw e-mail | index | archive | help
Hello,

I am working a driver for a new SRIOV nic on FreeBSD vms running on Hyper-V=
. The driver coding has largely completed. The performance test shows some =
scaling problems on the send path which I am seeking some advices.

The nic is 100Gbps. I am running iperf2 as client generating tcp traffic fr=
om the a 15-vcpu FreeBSD guest. So it has 15 tx and rx queues respectively.=
 When just using 1 iperf send stream (-P1), it hits over 30Gbps which is qu=
ite good. With 2 send streams(-P2) in iperf2, it reaches 43Gbps. The more s=
treams I use, the less obvious the scaling I can observe. The best performa=
nce is around 65Gbps with 6 send streams (-P6). After that, there seem sno =
much scaling I can see with more send streams, though the VM still has more=
 vcpu and tx queues available.=20

I can see a few things when doing the test, which I appreciate if anyone ca=
n provide more insight.=20

1. In those cases with higher number of send streams (>6), I can see more l=
ikely some send streams terminated with Broken Pipe error before then full =
test time ends. For example, in a test with 10 send streams for 30 seconds,=
 there could be one to four streams terminated in just a few seconds with B=
roken Pipe errors. The same test on Linux guest with same test server, I ha=
ve never seen such problem.

2. The driver selects the tx queue based on mbuf's m_pkthdr.flowid field. I=
 can see each stream get different 4-byte flowid values. However, it is ver=
y likely multiple flowids still collide to same tx queue, if we just use al=
gorithm like "flowid % number_of_tx_queues" to get the tx queue. Any sugges=
tions on how to avoid such case?

3. The tx ring size is 256. I allocate 1024 buf rings for each tx queue to =
queue up the send requests. I have seen under heavy tx load the tx queue ha=
s to be stopped till more completions are done, however, I have never seen =
any drbr queue errors. Does this number look good or need further optimizat=
ion?

3.  On the tx completion path, a task thread is scheduled for each tx queue=
 when completion interrupt is received. This thread is not bind to any cpu,=
 so it can run on any cpu. Is it useful to bind it to a specific cpu? I did=
 try this but see little difference.

Any other ideas are also very welcome.

Thanks,
Wei



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?SI2P153MB044159D903A78C18B6181544BB189>