Date: Thu, 30 Apr 2020 18:18:41 +0300 From: Pavel Vazharov <pavel@x3me.net> To: freebsd-questions@freebsd.org Subject: Performance troubleshouting of FreeBSD networking stack and/or kevent functionality Message-ID: <CAJEV1ijXbyCNxzzVyjofQikBCVP%2B19WfGBwEvtH30L4fGvX7=Q@mail.gmail.com>
next in thread | raw e-mail | index | archive | help
Hi there, First I want to say that I understand that the following questions are very broad and possibly only indirectly related to the FreeBSD networking (not sure). It's just that after more than a week spent on the below issue the only option I can see is to ask for help or some piece of advice. There is a project called F-Stack <https://github.com/F-Stack/f-stack>. It glues together the networking stack from FreeBSD 11.01 over DPDK <https://www.dpdk.org/>. It uses the DPDK to get the packets from the network card in user space and then uses the FreeBSD stack to handle the packets again in user space. It also provides socket API and epoll API which uses internally kqueue/kevent from the FreeBSD. We made a setup to test the performance of transparent TCP proxy based on F-Stack and another one running on Standard Linux kernel. We did the tests on KVM with 2 cores (Intel(R) Xeon(R) Gold 6139 CPU @ 2.30GHz) and 32GB RAM. 10Gbs NIC was attached in passthrough mode. The application level code, the one which handles epoll notifications and memcpy data between the sockets, of the both proxy applications is 100% the same. Both proxy applications are single threaded and in all tests we pinned the applications on core 1. The interrupts from the network card were pinned to the same core 1 for the test with the standard Linux application. Here are the test results: 1. The Linux based proxy was able to handle about 1.7-1.8 Gbps before it started to throttle the traffic. No visible CPU usage was observed on core 0 during the tests, only core 1, where the application and the IRQs were pinned, took the load. 2. The DPDK+FreeBSD proxy was able to thandle 700-800 Mbps before it started to throttle the traffic. No visible CPU usage was observed on core 0 during the tests only core 1, where the application was pinned, took the load. 3. We did another test with the DPDK+FreeBSD proxy just to give us some more info about the problem. We disabled the TCP proxy functionality and let the packets be simply ip forwarded by the FreeBSD stack. In this test we reached up to 5Gbps without being able to throttle the traffic. We just don't have more traffic to redirect there at the moment. 4. We did a profiling with Linux perf of the DPDK+FreeBSD proxy with 200 Mbps of traffic just to check if some functionality is visible bottleneck. If I understand the results correctly, the application spends most of its time reading packets from the network card and after that the time is spent in kevent related functionality. # Children Self Samples Command Shared Object Symbol # ........ ........ ............ ............... .................. ..................................................... # 43.46% 39.67% 9071 xproxy.release xproxy.release [.] main_loop | |--35.31%--main_loop | | | --3.71%--_recv_raw_pkts_vec_avx2 | |--5.44%--0x305f6e695f676e69 | main_loop | --2.68%--0 main_loop 25.51% 0.00% 0 xproxy.release xproxy.release [.] 0x0000000000cdbc40 | ---0xcdbc40 | |--5.03%--__cap_rights_set | |--4.65%--kern_kevent | |--3.85%--kqueue_kevent | |--3.62%--__cap_rights_init | |--3.45%--kern_kevent_fp | |--1.90%--fget | |--1.61%--uma_zalloc_arg | --1.40%--fget_unlocked 10.01% 0.00% 0 xproxy.release [unknown] [k] 0x00007fa0761d8010 | ---0x7fa0761d8010 | |--4.23%--ff_kevent_do_each | |--2.33%--net::ff_epoll_reactor_impl:: process_events <-- Only this function is ours | |--1.96%--kern_kevent | --1.48%--ff_epoll_wait 7.13% 7.12% 1627 xproxy.release xproxy.release [.] kqueue_kevent | |--3.84%--0xcdbc40 | kqueue_kevent | |--2.41%--0 | kqueue_kevent | --0.88%--kqueue_kevent 6.82% 0.00% 0 xproxy.release [unknown] [.] 0x0000000001010010 | ---0x1010010 | |--2.40%--uma_zalloc_arg | --1.22%--uma_zero_item 5. We did another profiling, just doing intrusive timing of some blocks of code, using again around 200Mbps of traffic and found again that about 30% of the application time is spent in the epoll_wait function which just setups the parameter for calling kern_kevent and call the function. The whole application can be very roughly represented in the following way: - Read incoming packets from the network card - Write pending outgoing packets to the network card - Push the incoming packets to the FreeBSD stack - Call epoll_wait/kevent without waiting <- About 25-30% of the application time seems to be spent here - Handle the events - loop from the beginning Here is the configuration for FreeBSD which was used for the tests [freebsd.boot] hz=100 fd_reserve=1024 kern.ncallout=524288 kern.sched.slice=1 kern.maxvnodes=524288 kern.ipc.nmbclusters=262144 kern.ipc.maxsockets=524000 net.inet.ip.fastforwarding=1 net.inet.tcp.syncache.hashsize=32768 net.inet.tcp.syncache.bucketlimit=32 net.inet.tcp.syncache.cachelimit=1048576 net.inet.tcp.tcbhashsize=524288 net.inet.tcp.syncache.rst_on_sock_fail=0 net.link.ifqmaxlen=4096 kern.features.inet6=0 net.inet6.ip6.auto_linklocal=0 net.inet6.ip6.accept_rtadv=2 net.inet6.icmp6.rediraccept=1 net.inet6.ip6.forwarding=0 [freebsd.sysctl] kern.maxfiles=524288 kern.maxfilesperproc=524288 kern.ipc.soacceptqueue=4096 kern.ipc.somaxconn=4096 kern.ipc.maxsockbuf=16777216 kern.ipc.nmbclusters=262144 kern.ipc.maxsockets=524288 net.link.ether.inet.maxhold=5 net.inet.ip.redirect=0 net.inet.ip.forwarding=1 net.inet.ip.portrange.first=1025 net.inet.ip.portrange.last=65535 net.inet.ip.intr_queue_maxlen=4096 net.inet.tcp.syncache.rst_on_sock_fail=0 net.inet.tcp.rfc1323=1 net.inet.tcp.fast_finwait2_recycle=1 net.inet.tcp.sendspace=16384 net.inet.tcp.recvspace=16384 net.inet.tcp.cc.algorithm=cubic net.inet.tcp.sendbuf_max=16777216 net.inet.tcp.recvbuf_max=16777216 net.inet.tcp.sendbuf_auto=1 net.inet.tcp.recvbuf_auto=1 net.inet.tcp.sendbuf_inc=16384 net.inet.tcp.recvbuf_inc=524288 net.inet.tcp.sack.enable=1 net.inet.tcp.msl=2000 net.inet.tcp.delayed_ack=1 net.inet.tcp.blackhole=2 net.inet.udp.blackhole=1 Something important!!! We've added functionality to the FreeBSD networking stack which allows us to open transparent TCP sockets when the first data packet after the 3 way handshake is received. I can explain why we need this functionality, if needed. I can show you the code/patch also, if needed. I checked this functionality multiple times. I can't see how it can lead to throttling the traffic, causing packets to be dropped and the whole thing to stop responds to regular pings and arpings due to the packet drops. This functionality was only applied for TCP traffic on port 80 during the tests. I mean, of course I could be missing something but bugs in this functionality usually lead to completely broken TCP connections or tapped connections due to the wrong TCP window. At least this is my experience so far having implemented similar functionality in the Linux kernel which we have been using for 3-4 years already. But again I could be wrong here. >From the above tests and measurements I made the following conclusions/observations: - The FreeBSD stack has no problems forwarding 5Gbps of traffic and thus the performance decrease should be caused of some of the above layers - TCP handling in the stack, kevent functionality and the application working with the kevent? - The kevent functionality appears in the CPU profiling with much higher numbers than any application code. This could be because the application is using the kevent in some wrong way or it could be just because the function is frequently called? On the other hand all of the functions in the loop are equally called. - For the Linux proxy case, the IRQs may be handled on a given core but the actual packet processing within the networking stack could happen on both cores and this could lead to better performance. However, we did not observe visible CPU usage on the core 0 during the tests. And finally, after this long post, here are my questions: 1. Does somebody have observations or educated guesses about what amount of traffic should I expect the FreeBSD stack + kevent to process in the above scenario? Are the numbers low or expected? 2. Does somebody can think of some the kevent specifics compared to Linux epoll which can lead to worse performance? For example the usage of EV_CLEAR flag? Reading too many or too few events at a time? 3. Can I check some counters of the FreeBSD stack which will point me to potential bottlenecks? 3. If somebody can give me some other advice, what more to check/debug/profile, or what config/sysctl settings to tweak to improve the performance of the DPDK+FreeBSD based proxy? Any help is appreciated! Thanks in advance, Pavel.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJEV1ijXbyCNxzzVyjofQikBCVP%2B19WfGBwEvtH30L4fGvX7=Q>