Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 30 Apr 2020 18:18:41 +0300
From:      Pavel Vazharov <pavel@x3me.net>
To:        freebsd-questions@freebsd.org
Subject:   Performance troubleshouting of FreeBSD networking stack and/or kevent functionality
Message-ID:  <CAJEV1ijXbyCNxzzVyjofQikBCVP%2B19WfGBwEvtH30L4fGvX7=Q@mail.gmail.com>

next in thread | raw e-mail | index | archive | help
Hi there,

First I want to say that I understand that the following questions are very
broad and possibly only indirectly related to the FreeBSD networking (not
sure). It's just that after more than a week spent on the below issue the
only option I can see is to ask for help or some piece of advice.

There is a project called F-Stack <https://github.com/F-Stack/f-stack>. It
glues together the networking stack from FreeBSD 11.01 over DPDK
<https://www.dpdk.org/>. It uses the DPDK to get the packets from the
network card in user space and then uses the FreeBSD stack to handle the
packets again in user space. It also provides socket API and epoll API
which uses internally kqueue/kevent from the FreeBSD.
We made a setup to test the performance of transparent TCP proxy based on
F-Stack and another one running on Standard Linux kernel. We did the tests
on KVM with 2 cores (Intel(R) Xeon(R) Gold 6139 CPU @ 2.30GHz) and 32GB
RAM. 10Gbs NIC was attached in passthrough mode.
The application level code, the one which handles epoll notifications and
memcpy data between the sockets, of the both proxy applications is 100% the
same. Both proxy applications are single threaded and in all tests we
pinned the applications on core 1. The interrupts from the network card
were pinned to the same core 1 for the test with the standard Linux
application.

Here are the test results:
1. The Linux based proxy was able to handle about 1.7-1.8 Gbps before it
started to throttle the traffic. No visible CPU usage was observed on core
0 during the tests, only core 1, where the application and the IRQs were
pinned, took the load.
2. The DPDK+FreeBSD proxy was able to thandle 700-800 Mbps before it
started to throttle the traffic. No visible CPU usage was observed on core
0 during the tests only core 1, where the application was pinned, took the
load.
3. We did another test with the DPDK+FreeBSD proxy just to give us some
more info about the problem. We disabled the TCP proxy functionality and
let the packets be simply ip forwarded by the FreeBSD stack. In this test
we reached up to 5Gbps without being able to throttle the traffic. We just
don't have more traffic to redirect there at the moment.
4. We did a profiling with Linux perf of the DPDK+FreeBSD proxy with 200
Mbps of traffic just to check if some functionality is visible bottleneck.
If I understand the results correctly, the application spends most of its
time reading packets from the network card and after that the time is spent
in kevent related functionality.

# Children      Self       Samples  Command          Shared Object
  Symbol
# ........  ........  ............  ...............
..................
.....................................................
#
    43.46%    39.67%          9071  xproxy.release   xproxy.release
  [.] main_loop
            |
            |--35.31%--main_loop
            |          |
            |           --3.71%--_recv_raw_pkts_vec_avx2
            |
            |--5.44%--0x305f6e695f676e69
            |          main_loop
            |
             --2.68%--0
                       main_loop

    25.51%     0.00%             0  xproxy.release   xproxy.release
  [.] 0x0000000000cdbc40
            |
            ---0xcdbc40
               |
               |--5.03%--__cap_rights_set
               |
               |--4.65%--kern_kevent
               |
               |--3.85%--kqueue_kevent
               |
               |--3.62%--__cap_rights_init
               |
               |--3.45%--kern_kevent_fp
               |
               |--1.90%--fget
               |
               |--1.61%--uma_zalloc_arg
               |
                --1.40%--fget_unlocked

    10.01%     0.00%             0  xproxy.release   [unknown]
  [k] 0x00007fa0761d8010
            |
            ---0x7fa0761d8010
               |
               |--4.23%--ff_kevent_do_each
               |
               |--2.33%--net::ff_epoll_reactor_impl:: process_events
<-- Only this function is ours
               |
               |--1.96%--kern_kevent
               |
                --1.48%--ff_epoll_wait

     7.13%     7.12%          1627  xproxy.release   xproxy.release
  [.] kqueue_kevent
            |
            |--3.84%--0xcdbc40
            |          kqueue_kevent
            |
            |--2.41%--0
            |          kqueue_kevent
            |
             --0.88%--kqueue_kevent

     6.82%     0.00%             0  xproxy.release   [unknown]
  [.] 0x0000000001010010
            |
            ---0x1010010
               |
               |--2.40%--uma_zalloc_arg
               |
                --1.22%--uma_zero_item

5. We did another profiling, just doing intrusive timing of some blocks of
code, using again around 200Mbps of traffic and found again that about 30%
of the application time is spent in the epoll_wait function which just
setups the parameter for calling kern_kevent and call the function.
The whole application can be very roughly represented in the following way:
- Read incoming packets from the network card
- Write pending outgoing packets to the network card
- Push the incoming packets to the FreeBSD stack
- Call epoll_wait/kevent without waiting <- About 25-30% of the application
time seems to be spent here
- Handle the events
- loop from the beginning

Here is the configuration for FreeBSD which was used for the tests

[freebsd.boot]
hz=100
fd_reserve=1024
kern.ncallout=524288
kern.sched.slice=1
kern.maxvnodes=524288
kern.ipc.nmbclusters=262144
kern.ipc.maxsockets=524000
net.inet.ip.fastforwarding=1
net.inet.tcp.syncache.hashsize=32768
net.inet.tcp.syncache.bucketlimit=32
net.inet.tcp.syncache.cachelimit=1048576
net.inet.tcp.tcbhashsize=524288
net.inet.tcp.syncache.rst_on_sock_fail=0
net.link.ifqmaxlen=4096
kern.features.inet6=0
net.inet6.ip6.auto_linklocal=0
net.inet6.ip6.accept_rtadv=2
net.inet6.icmp6.rediraccept=1
net.inet6.ip6.forwarding=0

[freebsd.sysctl]
kern.maxfiles=524288
kern.maxfilesperproc=524288
kern.ipc.soacceptqueue=4096
kern.ipc.somaxconn=4096
kern.ipc.maxsockbuf=16777216
kern.ipc.nmbclusters=262144
kern.ipc.maxsockets=524288
net.link.ether.inet.maxhold=5
net.inet.ip.redirect=0
net.inet.ip.forwarding=1
net.inet.ip.portrange.first=1025
net.inet.ip.portrange.last=65535
net.inet.ip.intr_queue_maxlen=4096
net.inet.tcp.syncache.rst_on_sock_fail=0
net.inet.tcp.rfc1323=1
net.inet.tcp.fast_finwait2_recycle=1
net.inet.tcp.sendspace=16384
net.inet.tcp.recvspace=16384
net.inet.tcp.cc.algorithm=cubic
net.inet.tcp.sendbuf_max=16777216
net.inet.tcp.recvbuf_max=16777216
net.inet.tcp.sendbuf_auto=1
net.inet.tcp.recvbuf_auto=1
net.inet.tcp.sendbuf_inc=16384
net.inet.tcp.recvbuf_inc=524288
net.inet.tcp.sack.enable=1
net.inet.tcp.msl=2000
net.inet.tcp.delayed_ack=1
net.inet.tcp.blackhole=2
net.inet.udp.blackhole=1


Something important!!!
We've added functionality to the FreeBSD networking stack which allows us
to open transparent TCP sockets when the first data packet after the 3 way
handshake is received. I can explain why we need this functionality, if
needed. I can show you the code/patch also, if needed. I checked this
functionality multiple times. I can't see how it can lead to throttling the
traffic, causing packets to be dropped and the whole thing to stop responds
to regular pings and arpings due to the packet drops. This functionality
was only applied for TCP traffic on port 80 during the tests. I mean, of
course I could be missing something but bugs in this functionality usually
lead to completely broken TCP connections or tapped connections due to the
wrong TCP window. At least this is my experience so far having implemented
similar functionality in the Linux kernel which we have been using for 3-4
years already. But again I could be wrong here.

>From the above tests and measurements I made the following
conclusions/observations:
- The FreeBSD stack has no problems forwarding 5Gbps of traffic and thus
the performance decrease should be caused of some of the above layers - TCP
handling in the stack, kevent functionality and the application working
with the kevent?
- The kevent functionality appears in the CPU profiling with much higher
numbers than any application code. This could be because the application is
using the kevent in some wrong way or it could be just because the function
is frequently called? On the other hand all of the functions in the loop
are equally called.
- For the Linux proxy case, the IRQs may be handled on a given core but the
actual packet processing within the networking stack could happen on both
cores and this could lead to better performance. However, we did not
observe visible CPU usage on the core 0 during the tests.

And finally, after this long post, here are my questions:
1. Does somebody have observations or educated guesses about what amount of
traffic should I expect the FreeBSD stack + kevent to process in the above
scenario? Are the numbers low or expected?
2. Does somebody can think of some the kevent specifics compared to Linux
epoll which can lead to worse performance? For example the usage of
EV_CLEAR flag? Reading too many or too few events at a time?
3. Can I check some counters of the FreeBSD stack which will point me to
potential bottlenecks?
3. If somebody can give me some other advice, what more to
check/debug/profile, or what config/sysctl settings to tweak to improve the
performance of the DPDK+FreeBSD based proxy?

Any help is appreciated!

Thanks in advance,
Pavel.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJEV1ijXbyCNxzzVyjofQikBCVP%2B19WfGBwEvtH30L4fGvX7=Q>