Date: Fri, 9 Jul 2021 09:16:59 +0000 From: Wei Hu via freebsd-net <freebsd-net@freebsd.org> To: "freebsd-net@FreeBSD.org" <freebsd-net@freebsd.org> Cc: "imp@bsdimp.com" <imp@bsdimp.com>, Li-Wen Hsu <lwhsu@freebsd.org> Subject: Send path scaling problem Message-ID: <SI2P153MB044159D903A78C18B6181544BB189@SI2P153MB0441.APCP153.PROD.OUTLOOK.COM>
next in thread | raw e-mail | index | archive | help
Hello, I am working a driver for a new SRIOV nic on FreeBSD vms running on Hyper-V= . The driver coding has largely completed. The performance test shows some = scaling problems on the send path which I am seeking some advices. The nic is 100Gbps. I am running iperf2 as client generating tcp traffic fr= om the a 15-vcpu FreeBSD guest. So it has 15 tx and rx queues respectively.= When just using 1 iperf send stream (-P1), it hits over 30Gbps which is qu= ite good. With 2 send streams(-P2) in iperf2, it reaches 43Gbps. The more s= treams I use, the less obvious the scaling I can observe. The best performa= nce is around 65Gbps with 6 send streams (-P6). After that, there seem sno = much scaling I can see with more send streams, though the VM still has more= vcpu and tx queues available.=20 I can see a few things when doing the test, which I appreciate if anyone ca= n provide more insight.=20 1. In those cases with higher number of send streams (>6), I can see more l= ikely some send streams terminated with Broken Pipe error before then full = test time ends. For example, in a test with 10 send streams for 30 seconds,= there could be one to four streams terminated in just a few seconds with B= roken Pipe errors. The same test on Linux guest with same test server, I ha= ve never seen such problem. 2. The driver selects the tx queue based on mbuf's m_pkthdr.flowid field. I= can see each stream get different 4-byte flowid values. However, it is ver= y likely multiple flowids still collide to same tx queue, if we just use al= gorithm like "flowid % number_of_tx_queues" to get the tx queue. Any sugges= tions on how to avoid such case? 3. The tx ring size is 256. I allocate 1024 buf rings for each tx queue to = queue up the send requests. I have seen under heavy tx load the tx queue ha= s to be stopped till more completions are done, however, I have never seen = any drbr queue errors. Does this number look good or need further optimizat= ion? 3. On the tx completion path, a task thread is scheduled for each tx queue= when completion interrupt is received. This thread is not bind to any cpu,= so it can run on any cpu. Is it useful to bind it to a specific cpu? I did= try this but see little difference. Any other ideas are also very welcome. Thanks, Wei
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?SI2P153MB044159D903A78C18B6181544BB189>