Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 4 May 2020 15:18:01 +0300
From:      Pavel Vazharov <pavel@x3me.net>
To:        freebsd-net@freebsd.org
Subject:   Fwd: Performance troubleshouting of FreeBSD networking stack and/or kevent functionality
Message-ID:  <CAJEV1ih9OyAy7tnj7oipLgzsVOJVE0bSNfXXB4zxQVCcoLxuyQ@mail.gmail.com>
In-Reply-To: <20200501213705.GA52782@neutralgood.org>
References:  <CAJEV1ijXbyCNxzzVyjofQikBCVP%2B19WfGBwEvtH30L4fGvX7=Q@mail.gmail.com> <20200501213705.GA52782@neutralgood.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Hi there,

First I want to say that I understand that the following questions are very
broad and possibly only indirectly related to the FreeBSD networking (not
sure). It's just that after more than a week spent on the below issue the
only option I can see is to ask for help or some piece of advice.

There is a project called F-Stack <https://github.com/F-Stack/f-stack>. It
glues together the networking stack from
FreeBSD 11.01 over DPDK <https://www.dpdk.org/>. It uses the DPDK to get
the packets from the
network card in user space and then uses the FreeBSD stack to handle the
packets again in user space. It also provides socket API and epoll API
which uses internally kqueue/kevent from the FreeBSD.
We made a setup to test the performance of transparent TCP proxy based on
F-Stack and another one running on Standard Linux kernel. We did the tests
on KVM with 2 cores (Intel(R) Xeon(R) Gold 6139 CPU @ 2.30GHz) and 32GB
RAM. 10Gbs NIC was attached in passthrough mode.
The application level code, the one which handles epoll notifications and
memcpy data between the sockets, of the both proxy applications is 100% the
same. Both proxy applications are single threaded and in all tests we
pinned the applications on core 1. The interrupts from the network card
were pinned to the same core 1 for the test with the standard Linux
application.

Here are the test results:
1. The Linux based proxy was able to handle about 1.7-1.8 Gbps before it
started to throttle the traffic. No visible CPU usage was observed on core
0 during the tests, only core 1, where the application and the IRQs were
pinned, took the load.
2. The DPDK+FreeBSD proxy was able to thandle 700-800 Mbps before it
started to throttle the traffic. No visible CPU usage was observed on core
0 during the tests only core 1, where the application was pinned, took the
load.
3. We did another test with the DPDK+FreeBSD proxy just to give us some
more info about the problem. We disabled the TCP proxy functionality and
let the packets be simply ip forwarded by the FreeBSD stack. In this test
we reached up to 5Gbps without being able to throttle the traffic. We just
don't have more traffic to redirect there at the moment.
4. We did a profiling with Linux perf of the DPDK+FreeBSD proxy with
200 Mbps of traffic just to check if some functionality is visible
bottleneck.
If I understand the results correctly, the application spends most of its
time reading packets from the network card and after that the time is spent
in kevent related functionality.

# Children      Self       Samples  Command          Shared Object Symbol
# ........  ........  ............  ............... ..................
....................................................
#
     43.46%    39.67%          9071  xproxy.release   xproxy.release
    [.] main_loop
             |
             |--35.31%--main_loop
             |          |
             |           --3.71%--_recv_raw_pkts_vec_avx2
             |
             |--5.44%--0x305f6e695f676e69
             |          main_loop
             |
              --2.68%--0
                        main_loop

     25.51%     0.00%             0  xproxy.release   xproxy.release
    [.] 0x0000000000cdbc40
             |
             ---0xcdbc40
                |
                |--5.03%--__cap_rights_set
                |
                |--4.65%--kern_kevent
                |
                |--3.85%--kqueue_kevent
                |
                |--3.62%--__cap_rights_init
                |
                |--3.45%--kern_kevent_fp
                |
                |--1.90%--fget
                |
                |--1.61%--uma_zalloc_arg
                |
                 --1.40%--fget_unlocked

     10.01%     0.00%             0  xproxy.release   [unknown]
    [k] 0x00007fa0761d8010
             |
             ---0x7fa0761d8010
                |
                |--4.23%--ff_kevent_do_each
                |
                |--2.33%--net::ff_epoll_reactor_impl:: process_events <--
Only this function is ours
                |
                |--1.96%--kern_kevent
                |
                 --1.48%--ff_epoll_wait

      7.13%     7.12%          1627  xproxy.release   xproxy.release
    [.] kqueue_kevent
             |
             |--3.84%--0xcdbc40
             |          kqueue_kevent
             |
             |--2.41%--0
             |          kqueue_kevent
             |
              --0.88%--kqueue_kevent

      6.82%     0.00%             0  xproxy.release   [unknown]
    [.] 0x0000000001010010
             |
             ---0x1010010
                |
                |--2.40%--uma_zalloc_arg
                |
                 --1.22%--uma_zero_item

5. We did another profiling, just doing intrusive timing of some blocks of
 code, using again around 200Mbps of traffic and found again that about 30%
 of the application time is spent in the epoll_wait function which just
 setups the parameter for calling kern_kevent and call the function.
 The whole application can be very roughly represented in the following way:
 - Read incoming packets from the network card
 - Write pending outgoing packets to the network card
 - Push the incoming packets to the FreeBSD stack
 - Call epoll_wait/kevent without waiting <- About 25-30% of the
application time seems to be spent here
 - Handle the events
 - loop from the beginning

 Here is the configuration for FreeBSD which was used for the tests

 [freebsd.boot]
 hz=100
 fd_reserve=1024
 kern.ncallout=524288
 kern.sched.slice=1
 kern.maxvnodes=524288
 kern.ipc.nmbclusters=262144
 kern.ipc.maxsockets=524000
 net.inet.ip.fastforwarding=1
 net.inet.tcp.syncache.hashsize=32768
 net.inet.tcp.syncache.bucketlimit=32
 net.inet.tcp.syncache.cachelimit=1048576
 net.inet.tcp.tcbhashsize=524288
 net.inet.tcp.syncache.rst_on_sock_fail=0
 net.link.ifqmaxlen=4096
 kern.features.inet6=0
 net.inet6.ip6.auto_linklocal=0
 net.inet6.ip6.accept_rtadv=2
 net.inet6.icmp6.rediraccept=1
 net.inet6.ip6.forwarding=0

 [freebsd.sysctl]
 kern.maxfiles=524288
 kern.maxfilesperproc=524288
 kern.ipc.soacceptqueue=4096
 kern.ipc.somaxconn=4096
 kern.ipc.maxsockbuf=16777216
 kern.ipc.nmbclusters=262144
 kern.ipc.maxsockets=524288
 net.link.ether.inet.maxhold=5
 net.inet.ip.redirect=0
 net.inet.ip.forwarding=1
 net.inet.ip.portrange.first=1025
 net.inet.ip.portrange.last=65535
 net.inet.ip.intr_queue_maxlen=4096
 net.inet.tcp.syncache.rst_on_sock_fail=0
 net.inet.tcp.rfc1323=1
 net.inet.tcp.fast_finwait2_recycle=1
 net.inet.tcp.sendspace=16384
 net.inet.tcp.recvspace=16384
 net.inet.tcp.cc.algorithm=cubic
 net.inet.tcp.sendbuf_max=16777216
 net.inet.tcp.recvbuf_max=16777216
 net.inet.tcp.sendbuf_auto=1
 net.inet.tcp.recvbuf_auto=1
 net.inet.tcp.sendbuf_inc=16384
 net.inet.tcp.recvbuf_inc=524288
 net.inet.tcp.sack.enable=1
 net.inet.tcp.msl=2000
 net.inet.tcp.delayed_ack=1
 net.inet.tcp.blackhole=2
 net.inet.udp.blackhole=1


Something important!!!
We've added functionality to the FreeBSD networking stack which allows us
to open transparent TCP sockets when the first data packet after the 3 way
handshake is received. I can explain why we need this functionality, if
needed. I can show you the code/patch also, if needed. I checked this
functionality multiple times. I can't see how it can lead to throttling the
traffic, causing packets to be dropped and the whole thing to stop responds
to regular pings and arpings due to the packet drops. This functionality
was only applied for TCP traffic on port 80 during the tests. I mean, of
course I could be missing something but bugs in this functionality usually
lead to completely broken TCP connections or tapped connections due to the
wrong TCP window. At least this is my experience so far having implemented
similar functionality in the Linux kernel which we have been using for 3-4
years already. But again I could be wrong here.

>From the above tests and measurements I made the following
conclusions/observations:
- The FreeBSD stack has no problems forwarding 5Gbps of traffic and thus
the performance decrease should be caused of some of the above layers - TCP
handling in the stack, kevent functionality and the application working
with the kevent?
- The kevent functionality appears in the CPU profiling with much higher
numbers than any application code. This could be because the application is
using the kevent in some wrong way or it could be just because the function
is frequently called? On the other hand all of the functions in the loop
are equally called.
- For the Linux proxy case, the IRQs may be handled on a given core but the
actual packet processing within the networking stack could happen on both
cores and this could lead to better performance. However, we did not
observe visible CPU usage on the core 0 during the tests.

And finally, after this long post, here are my questions:
1. Does somebody have observations or educated guesses about what amount of
traffic should I expect the FreeBSD stack + kevent to process in the above
scenario? Are the numbers low or expected?
2. Does somebody can think of some the kevent specifics compared to Linux
epoll which can lead to worse performance? For example the usage of
EV_CLEAR flag? Reading too many or too few events at a time?
3. Can I check some counters of the FreeBSD stack which will point me to
potential bottlenecks?
3. If somebody can give me some other advice, what more to
check/debug/profile, or what config/sysctl settings to tweak to improve the
performance of the DPDK+FreeBSD based proxy?

One last thing which I was thinking in the last few days. As far as I know
the interrupts
will always preempt the currently working user space code. So, if this is
right, for the Linux
case we'll have much more time spent hanling interrutps and much less time
spent in the
user space and epoll notifications handling. The siutation is different
though in the
F-stack application loop. There pending packets are sent, 32 packets are
read from the network card,
pushed to the FreeBSD stack and kevent is called and the loop repeats. This
means that the
time slice for packets read and processing in the stack is limited by the
kevent call. So, I thought that,
just for the test, changing the ratio between the packets processing and
kevent calls should improve the
situation. So, I did test where the kevent was not called every iteration
of the loop but once 1024 packets
were read and pushed to the network stack. However, by some reason this
didn't improve the situaion
and currently I have no explanation for this, too. Maybe I did something
wrong when testing.

Any help is appreciated!

Thanks in advance,
Pavel.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJEV1ih9OyAy7tnj7oipLgzsVOJVE0bSNfXXB4zxQVCcoLxuyQ>