Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 03 Jul 2012 20:11:14 +0400
From:      "Alexander V. Chernikov" <melifaro@FreeBSD.org>
To:        net@freebsd.org, hackers@freebsd.org, performance@freebsd.org
Subject:   FreeBSD 10G forwarding performance @Intel
Message-ID:  <4FF319A2.6070905@FreeBSD.org>

next in thread | raw e-mail | index | archive | help
Hello list!

I'm quite stuck with bad forwarding performance on many FreeBSD boxes 
doing firewalling.

Typical configuration is E5645 / E5675 @ Intel 82599 NIC.
HT is turned off.
(Configs and tunables below).

I'm mostly concerned with unidirectional traffic flowing to single 
interface (e.g. using singe route entry).

In most cases system can forward no more than 700 (or 1400) kpps which 
is quite a bad number (Linux does, say, 5MPPs on nearly the same hardware).


Test scenario:

Ixia XM2 (traffic generator) <> ix0 (FreeBSD).

Ixia sends 64byte IP packets from vlan10 (10.100.0.64 - 10.100.0.156) to
destinations in vlan11 (10.100.1.128 - 10.100.1.192).

Static arps are configured for all destination addresses.

Traffic level is slightly above or slightly below system performance.


================= Test 1  =======================
Kernel: FreeBSD-8-S r237994, stock drivers, stock routing, no FLOWTABLE, 
no firewall

Traffic: 1-1 flow (1 src, 1 dst)
(This is actually a bit different from described above)

Result:
              input          (ix0)           output
     packets  errs idrops      bytes    packets  errs      bytes colls
        878k   48k     0        59M       878k     0        56M     0
        874k   48k     0        59M       874k     0        56M     0
        875k   48k     0        59M       875k     0        56M     0

16:41 [0] test15# top -nCHSIzs1 | awk '$5 ~ /(K|SIZE)/ { printf "  %7s 
%2s %6s %10s %15s %s\n", $7, $8, $9, $10, $11, $12}'
      STATE  C   TIME        CPU         COMMAND
       CPU6  6  17:28    100.00%      kernel{ix0 que}
       CPU9  9  20:42     60.06%    intr{irq265: ix0:que

16:41 [0] test15# vmstat -i | grep ix0
irq256: ix0:que 0                 500796        167
irq257: ix0:que 1                6693573       2245
irq258: ix0:que 2                2572380        862
irq259: ix0:que 3                3166273       1062
irq260: ix0:que 4                9691706       3251
irq261: ix0:que 5               10766434       3611
irq262: ix0:que 6                8933774       2996
irq263: ix0:que 7                5246879       1760
irq264: ix0:que 8                3548930       1190
irq265: ix0:que 9               11817986       3964
irq266: ix0:que 10                227561         76
irq267: ix0:link                       1          0

Note that system is using 2 cores to forward, so 12 cores should be able 
to forward 4+ mpps which is more or less consistent with Linux results. 
Note that interrupts on all queues are (as far as I understand from the 
fact that AIM is turned off and interrupt rates are the same from 
previous test). Additionally, despite hw.intr_storm_threshold = 200k, 
i'm constantly getting
interrupt storm detected on "irq265:"; throttling interrupt source
message.


================= Test 2  =======================
Kernel: FreeBSD-8-S r237994, stock drivers, stock routing, no FLOWTABLE, 
no firewall

Traffic: Unidirectional many-2-many

16:20 [0] test15# netstat -I ix0 -hw 1
              input          (ix0)           output
     packets  errs idrops      bytes    packets  errs      bytes colls
        507k  651k     0        74M       508k     0        32M     0
        506k  652k     0        74M       507k     0        28M     0
        509k  652k     0        74M       508k     0        37M     0


16:28 [0] test15# top -nCHSIzs1 | awk '$5 ~ /(K|SIZE)/ { printf "  %7s 
%2s %6s %10s %15s %s\n", $7, $8, $9, $10, $11, $12}'
      STATE  C   TIME        CPU         COMMAND
      CPU10  6   0:40    100.00%      kernel{ix0 que}
       CPU2  2  11:47     84.86%    intr{irq258: ix0:que
       CPU3  3  11:50     81.88%    intr{irq259: ix0:que
       CPU8  8  11:38     77.69%    intr{irq264: ix0:que
       CPU7  7  11:24     77.10%    intr{irq263: ix0:que
       WAIT  1  10:10     74.76%    intr{irq257: ix0:que
       CPU4  4   8:57     63.48%    intr{irq260: ix0:que
       CPU6  6   8:35     61.96%    intr{irq262: ix0:que
       CPU9  9  14:01     60.79%    intr{irq265: ix0:que
        RUN  0   9:07     59.67%    intr{irq256: ix0:que
       WAIT  5   6:13     43.26%    intr{irq261: ix0:que
      CPU11 11   5:19     35.89%      kernel{ix0 que}
          -  4   3:41     25.49%      kernel{ix0 que}
          -  1   3:22     21.78%      kernel{ix0 que}
          -  1   2:55     17.68%      kernel{ix0 que}
          -  4   2:24     16.55%      kernel{ix0 que}
          -  1   9:54     14.99%      kernel{ix0 que}
       CPU0 11   2:13     14.26%      kernel{ix0 que}


16:07 [0] test15# vmstat -i | grep ix0
irq256: ix0:que 0                  13654         15
irq257: ix0:que 1                  87043         96
irq258: ix0:que 2                  39604         44
irq259: ix0:que 3                  48308         53
irq260: ix0:que 4                 138002        153
irq261: ix0:que 5                 169596        188
irq262: ix0:que 6                 107679        119
irq263: ix0:que 7                  72769         81
irq264: ix0:que 8                  30878         34
irq265: ix0:que 9                1002032       1115
irq266: ix0:que 10                 10967         12
irq267: ix0:link                       1          0


Note that all cores are loaded more or less evenly, but the result is 
_worse_. The first reason for this is mtx_lock which is acquired twice 
on every lookup (once in in in_matroute() where it can possibly be 
removed and once again in rtalloc1_fib()). Latter one is addressed by 
andre@ in r234650).

Additionally, despite itreads are bound to singe CPU each, kernel que 
are not in stock setup. However, configuration with 5 queues and 5 
kernel threads bound to different CPU provides the same bad results.

================= Test 3  =======================
Kernel: FreeBSD-8-S June 4 SVN, +merged ifaddrlock, stock drivers, stock 
routing, no FLOWTABLE, no firewall


     packets  errs idrops      bytes    packets  errs      bytes colls
        580k   18k     0        38M       579k     0        37M     0
        581k   26k     0        39M       580k     0        37M     0
        580k   24k     0        39M       580k     0        37M     0
................
Enabling ipfw _increases_ performance a bit:

        604k     0     0        39M       604k     0        39M     0
        604k     0     0        39M       604k     0        39M     0
        582k   19k     0        38M       568k     0        37M     0
        527k   81k     0        39M       530k     0        34M     0
        605k    28     0        39M       605k     0        39M     0


================= Test 3.1  =======================

Same as test 3, the only difference is the following:
route add -net 10.100.1.160/27 -iface vlan11.

              input          (ix0)           output
     packets  errs idrops      bytes    packets  errs      bytes colls
        543k  879k     0        91M       544k     0        35M     0
        547k  870k     0        91M       545k     0        35M     0
        541k  870k     0        91M       539k     0        30M     0
        952k  565k     0        97M       962k     0        48M     0
        1.2M  228k     0        91M       1.2M     0        92M     0
        1.2M  226k     0        90M       1.1M     0        76M     0
        1.1M  228k     0        91M       1.2M     0        76M     0
        1.2M  233k     0        90M       1.2M     0        76M     0

================= Test 3.2  =======================

Same as test 3, splitting destination into 4 smaller rtes:
route add -net 10.100.1.128/28 -iface vlan11
route add -net 10.100.1.144/28 -iface vlan11
route add -net 10.100.1.160/28 -iface vlan11
route add -net 10.100.1.176/28 -iface vlan11

              input          (ix0)           output
     packets  errs idrops      bytes    packets  errs      bytes colls
        1.4M     0     0       106M       1.6M     0       106M     0
        1.8M     0     0       106M       1.6M     0        71M     0
        1.6M     0     0       106M       1.6M     0        71M     0
        1.6M     0     0        87M       1.6M     0        71M     0
        1.6M     0     0       126M       1.6M     0       212M     0

================= Test 3.3  =======================

Same as test 3, splitting destination into 16 smaller rtes:
              input          (ix0)           output
     packets  errs idrops      bytes    packets  errs      bytes colls
        1.6M     0     0       118M       1.8M     0       118M     0
        2.0M     0     0       118M       1.8M     0       119M     0
        1.8M     0     0       119M       1.8M     0        79M     0
        1.8M     0     0       117M       1.8M     0       157M     0


================= Test 4  =======================
Kernel: FreeBSD-8-S June 4 SVN, stock drivers, routing patch 1, no 
FLOWTABLE, no firewall

              input          (ix0)           output
     packets  errs idrops      bytes    packets  errs      bytes colls
        1.8M     0     0       114M       1.9M     0       114M     0
        1.7M     0     0       114M       1.7M     0       114M     0
        1.8M     0     0       114M       1.8M     0       114M     0
        1.7M     0     0       114M       1.7M     0       114M     0
        1.8M     0     0       114M       1.8M     0        74M     0
        1.5M     0     0       114M       1.8M     0        74M     0
          2M     0     0       114M       1.8M     0       194M     0


Patch 1 totally eliminates mtx_lock for fastforwarding path to get an 
idea how much performance we can achieve. The result is nearly the same 
as in 3.3

================= Test 4.1  =======================

Same as the test 4, same traffic level, enabling firewall with single 
allow rule (evaluating RLOCK performance)

22:35 [0] test15# netstat -I ix0 -hw 1
              input          (ix0)           output
     packets  errs idrops      bytes    packets  errs      bytes colls
        1.8M  149k     0       114M       1.6M     0       142M     0
        1.4M  148k     0        85M       1.6M     0       104M     0
        1.8M  149k     0       143M       1.6M     0       104M     0
        1.6M  151k     0       114M       1.6M     0       104M     0
        1.6M  151k     0       114M       1.6M     0       104M     0
        1.4M  152k     0       114M       1.6M     0       104M     0

E.g something like 10% performance loss.


================= Test 4.2  =======================

Same as test4, playing with number of queues.

5queues, same traffic level
        1.5M  225k     0       114M       1.5M     0        99M     0

================= Test 4.3  =======================

Same as test 4, HT on, number of queues = 16

              input          (ix0)           output
     packets  errs idrops      bytes    packets  errs      bytes colls
        2.4M     0     0       157M       2.4M     0       156M     0
        2.4M     0     0       156M       2.4M     0       157M     0

However, enabling firewall immediately drops rate to 1.9mpps which is 
nearly the same as 4.1 (and complicated fw ruleset possibly kill HT core 
much faster)

================= Test 4.3  =======================

Same as test4, kerwnel ix0 que Tx threads bound to specific CPUs 
(corresponding to RX ):
18:02 [0] test15# procstat -ak | grep ix0 | sort -nk 2
     12 100045 intr             irq256: ix0:que  <running>
      0 100046 kernel           ix0 que          <running>
     12 100047 intr             irq257: ix0:que  <running>
      0 100048 kernel           ix0 que          mi_switch sleepq_wait 
msleep_spin taskqueue_thread_loop fork_exit fork_trampoline
     12 100049 intr             irq258: ix0:que  <running>
..

test15# for i in `jot 12 0`; do cpuset -l $i -t $((100046+2*$i)); done

Result:
              input          (ix0)           output
     packets  errs idrops      bytes    packets  errs      bytes colls
        2.1M     0     0       139M         2M     0       193M     0
        2.1M     0     0       139M       2.3M     0       139M     0
        2.1M     0     0       139M       2.1M     0        85M     0
        2.1M     0     0       139M       2.1M     0       193M     0

Quite considerable increase, however this works better for uniform 
traffic distribution only.


================= Test 5  =======================
Same as test 4, make radix use rmlock (r234648, r234649).

Result: 1.7 MPPS.


================= Test 6  =======================
Same as test 4 + FLOWTABLE

Result: 1.7 MPPS.


================= Test 7  =======================
Same as test 4, build with GCC 4.7

Result: No performance gain


Further investigations:

================= Test 8  =======================
Test 4 setup with kernel build with LOCK_PROFILING.

17:46 [0] test15# sysctl debug.lock.prof.enable=1 ; sleep 2 ; sysctl 
debug.lock.prof.enable=0

        920k     0     0        59M       920k     0        59M     0
        875k     0     0        59M       920k     0        59M     0
        628k     0     0        39M       566k     0        45M     0
         79k  2.7M     0       186M        57k     0       6.5M     0
         71k  878k     0        61M        73k     0       4.0M     0
        891k  254k     0        72M       917k     0        54M     0
        920k     0     0        59M       920k     0        59M     0


When enabled, forwarding performance goes down to 60kpps.
Enabled for 2 seconds (so actually 130k packets forwarded), results 
attached as separate file. Several hundred lock contentions in ixgbe, 
that's all.

================= Test 9  =======================
Same as test 4 setup with hwpmc.
Results attached.

================= Test 9  =======================
Kernel: Freebsd-9-S.
No major difference


Some (my) preliminary conclusions:
1) rte mtx_lock should (and can) be eliminated from stock kernel. (And 
it can be done more or less easily for in_matroute).
2) rmlock vs rwlock performance difference is insignificant (maybe 
because of 3) )
3) there are locks contention between ixgbe taskq threads and ithreads. 
I'm not sure if taskq threads are necessary in the case of packet 
forwarding and not traffic generation.


Maybe I'm missing something else? (l2 cache misses or other things).

What else I can do to debug this further?



Relevant files:
http://static.ipfw.ru/files/fbsd10g/0001-no-rt-mutex.patch
http://static.ipfw.ru/files/fbsd10g/kernel.gprof.txt
http://static.ipfw.ru/files/fbsd10g/prof_stats.txt

============= CONFIGS ====================

sysctl.conf:
kern.ipc.maxsockbuf=33554432
net.inet.udp.maxdgram=65535
net.inet.udp.recvspace=16777216
net.inet.tcp.sendbuf_auto=0
net.inet.tcp.recvbuf_auto=0
net.inet.tcp.sendspace=16777216
net.inet.tcp.recvspace=16777216
net.inet.ip.maxfragsperpacket=64


kern.random.sys.harvest.ethernet=0
kern.random.sys.harvest.point_to_point=0
kern.random.sys.harvest.interrupt=0


net.inet.ip.forwarding=1
net.inet.ip.fastforwarding=1
net.inet.ip.redirect=0

hw.intr_storm_threshold=20000

loader.conf:
kern.ipc.nmbclusters="512000"
ixgbe_load="YES"
hw.ixgbe.rx_process_limit="300"
hw.ixgbe.nojumbobuf="1"
hw.ixgbe.max_loop="100"
hw.ixgbe.max_interrupt_rate="20000"
hw.ixgbe.num_queues="11"


hw.ixgbe.txd=4096
hw.ixgbe.rxd=4096

kern.hwpmc.nbuffers=2048

debug.debugger_on_panic=1
net.inet.ip.fw.default_to_accept=1


kernel:
cpu HAMMER

ident           CORE_RELENG_7
options COMPAT_IA32

makeoptions     DEBUG=-g                # Build kernel with gdb(1) debug 
symbols

options         SCHED_ULE               # ULE scheduler
options         PREEMPTION              # Enable kernel thread preemption
options         INET                    # InterNETworking
options         INET6                   # IPv6 communications protocols
options         SCTP                    # Stream Control Transmission 
Protocol
options         FFS                     # Berkeley Fast Filesystem
options         SOFTUPDATES             # Enable FFS soft updates support
options         UFS_ACL                 # Support for access control lists
options         UFS_DIRHASH             # Improve performance on big 
directories
options         UFS_GJOURNAL            # Enable gjournal-based UFS 
journaling
options         MD_ROOT                 # MD is a potential root device
options         PROCFS                  # Process filesystem (requires 
PSEUDOFS)
options         PSEUDOFS                # Pseudo-filesystem framework
options         GEOM_PART_GPT           # GUID Partition Tables.
options         GEOM_LABEL              # Provides labelization
options         COMPAT_43TTY            # BSD 4.3 TTY compat [KEEP THIS!]
options         COMPAT_FREEBSD4         # Compatible with FreeBSD4
options         COMPAT_FREEBSD5         # Compatible with FreeBSD5
options         COMPAT_FREEBSD6         # Compatible with FreeBSD6
options         COMPAT_FREEBSD7         # Compatible with FreeBSD7
options COMPAT_FREEBSD32
options         SCSI_DELAY=4000         # Delay (in ms) before probing SCSI
options         KTRACE                  # ktrace(1) support
options         STACK                   # stack(9) support
options         SYSVSHM                 # SYSV-style shared memory
options         SYSVMSG                 # SYSV-style message queues
options         SYSVSEM                 # SYSV-style semaphores
options         _KPOSIX_PRIORITY_SCHEDULING # POSIX P1003_1B real-time 
extensions
options         KBD_INSTALL_CDEV        # install a CDEV entry in /dev
options         AUDIT                   # Security event auditing
options         HWPMC_HOOKS
options         GEOM_MIRROR
options         MROUTING
options         PRINTF_BUFR_SIZE=100

# To make an SMP kernel, the next two lines are needed
options         SMP                     # Symmetric MultiProcessor Kernel

# CPU frequency control
device          cpufreq

# Bus support.
device          acpi
device          pci

device          ada
device          ahci

# SCSI Controllers
device          ahd             # AHA39320/29320 and onboard AIC79xx devices
options         AHD_REG_PRETTY_PRINT    # Print register bitfields in debug
                                          # output.  Adds ~215k to driver.
device          mpt             # LSI-Logic MPT-Fusion
# SCSI peripherals
device          scbus           # SCSI bus (required for SCSI)
device          da              # Direct Access (disks)
device          pass            # Passthrough device (direct SCSI access)
device          ses             # SCSI Environmental Services (and SAF-TE)

# RAID controllers
device          mfi             # LSI MegaRAID SAS

# atkbdc0 controls both the keyboard and the PS/2 mouse
device          atkbdc          # AT keyboard controller
device          atkbd           # AT keyboard
device          psm             # PS/2 mouse

device          kbdmux          # keyboard multiplexer

device          vga             # VGA video card driver

device          splash          # Splash screen and screen saver support

# syscons is the default console driver, resembling an SCO console
device          sc

device          agp             # support several AGP chipsets

## Power management support (see NOTES for more options)
#device         apm
## Add suspend/resume support for the i8254.
#device         pmtimer

# Serial (COM) ports
#device         sio             # 8250, 16[45]50 based serial ports
device          uart            # Generic UART driver

# If you've got a "dumb" serial or parallel PCI card that is
# supported by the puc(4) glue driver, uncomment the following
# line to enable it (connects to sio, uart and/or ppc drivers):
#device         puc

# PCI Ethernet NICs.
device          em              # Intel PRO/1000 adapter Gigabit 
Ethernet Card
device          bce
#device         ixgb            # Intel PRO/10GbE Ethernet Card
#device         ixgbe

# PCI Ethernet NICs that use the common MII bus controller code.
# NOTE: Be sure to keep the 'device miibus' line in order to use these NICs!
device          miibus          # MII bus support

# Pseudo devices.
device          loop            # Network loopback
device          random          # Entropy device
device          ether           # Ethernet support
device          pty             # Pseudo-ttys (telnet etc)
device          md              # Memory "disks"
device          firmware        # firmware assist module
device          lagg

# The `bpf' device enables the Berkeley Packet Filter.
# Be aware of the administrative consequences of enabling this!
# Note that 'bpf' is required for DHCP.
device          bpf             # Berkeley packet filter

# USB support
device          uhci            # UHCI PCI->USB interface
device          ohci            # OHCI PCI->USB interface
device          ehci            # EHCI PCI->USB interface (USB 2.0)
device          usb             # USB Bus (required)
#device         udbp            # USB Double Bulk Pipe devices
device          uhid            # "Human Interface Devices"
device          ukbd            # Keyboard
device          umass           # Disks/Mass storage - Requires scbus and da
device          ums             # Mouse
# USB Serial devices
device          ucom            # Generic com ttys


options         INCLUDE_CONFIG_FILE

options         KDB
options         KDB_UNATTENDED
options         DDB
options         ALT_BREAK_TO_DEBUGGER

options         IPFIREWALL              #firewall
options         IPFIREWALL_FORWARD      #packet destination changes
options         IPFIREWALL_VERBOSE      #print information about
                                          # dropped packets
options         IPFIREWALL_VERBOSE_LIMIT=10000    #limit verbosity

# MRT support
options         ROUTETABLES=16

device          vlan                    #VLAN support

# Size of the kernel message buffer.  Should be N * pagesize.
options         MSGBUF_SIZE=4096000


options         SW_WATCHDOG
options         PANIC_REBOOT_WAIT_TIME=4

#
# Hardware watchdog timers:
#
# ichwd: Intel ICH watchdog timer
#
#device          ichwd

device          smbus
device          ichsmb
device          ipmi




-- 
WBR, Alexander




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4FF319A2.6070905>