From owner-freebsd-performance@FreeBSD.ORG Tue Jul 3 16:12:24 2012 Return-Path: Delivered-To: performance@freebsd.org Received: from mx2.freebsd.org (mx2.freebsd.org [69.147.83.53]) by hub.freebsd.org (Postfix) with ESMTP id 13DD6106566C; Tue, 3 Jul 2012 16:12:24 +0000 (UTC) (envelope-from melifaro@FreeBSD.org) Received: from dhcp170-36-red.yandex.net (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx2.freebsd.org (Postfix) with ESMTP id 333D114DE4F; Tue, 3 Jul 2012 16:12:21 +0000 (UTC) Message-ID: <4FF319A2.6070905@FreeBSD.org> Date: Tue, 03 Jul 2012 20:11:14 +0400 From: "Alexander V. Chernikov" User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:12.0) Gecko/20120511 Thunderbird/12.0.1 MIME-Version: 1.0 To: net@freebsd.org, hackers@freebsd.org, performance@freebsd.org Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Cc: Subject: FreeBSD 10G forwarding performance @Intel X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 03 Jul 2012 16:12:24 -0000 Hello list! I'm quite stuck with bad forwarding performance on many FreeBSD boxes doing firewalling. Typical configuration is E5645 / E5675 @ Intel 82599 NIC. HT is turned off. (Configs and tunables below). I'm mostly concerned with unidirectional traffic flowing to single interface (e.g. using singe route entry). In most cases system can forward no more than 700 (or 1400) kpps which is quite a bad number (Linux does, say, 5MPPs on nearly the same hardware). Test scenario: Ixia XM2 (traffic generator) <> ix0 (FreeBSD). Ixia sends 64byte IP packets from vlan10 (10.100.0.64 - 10.100.0.156) to destinations in vlan11 (10.100.1.128 - 10.100.1.192). Static arps are configured for all destination addresses. Traffic level is slightly above or slightly below system performance. ================= Test 1 ======================= Kernel: FreeBSD-8-S r237994, stock drivers, stock routing, no FLOWTABLE, no firewall Traffic: 1-1 flow (1 src, 1 dst) (This is actually a bit different from described above) Result: input (ix0) output packets errs idrops bytes packets errs bytes colls 878k 48k 0 59M 878k 0 56M 0 874k 48k 0 59M 874k 0 56M 0 875k 48k 0 59M 875k 0 56M 0 16:41 [0] test15# top -nCHSIzs1 | awk '$5 ~ /(K|SIZE)/ { printf " %7s %2s %6s %10s %15s %s\n", $7, $8, $9, $10, $11, $12}' STATE C TIME CPU COMMAND CPU6 6 17:28 100.00% kernel{ix0 que} CPU9 9 20:42 60.06% intr{irq265: ix0:que 16:41 [0] test15# vmstat -i | grep ix0 irq256: ix0:que 0 500796 167 irq257: ix0:que 1 6693573 2245 irq258: ix0:que 2 2572380 862 irq259: ix0:que 3 3166273 1062 irq260: ix0:que 4 9691706 3251 irq261: ix0:que 5 10766434 3611 irq262: ix0:que 6 8933774 2996 irq263: ix0:que 7 5246879 1760 irq264: ix0:que 8 3548930 1190 irq265: ix0:que 9 11817986 3964 irq266: ix0:que 10 227561 76 irq267: ix0:link 1 0 Note that system is using 2 cores to forward, so 12 cores should be able to forward 4+ mpps which is more or less consistent with Linux results. Note that interrupts on all queues are (as far as I understand from the fact that AIM is turned off and interrupt rates are the same from previous test). Additionally, despite hw.intr_storm_threshold = 200k, i'm constantly getting interrupt storm detected on "irq265:"; throttling interrupt source message. ================= Test 2 ======================= Kernel: FreeBSD-8-S r237994, stock drivers, stock routing, no FLOWTABLE, no firewall Traffic: Unidirectional many-2-many 16:20 [0] test15# netstat -I ix0 -hw 1 input (ix0) output packets errs idrops bytes packets errs bytes colls 507k 651k 0 74M 508k 0 32M 0 506k 652k 0 74M 507k 0 28M 0 509k 652k 0 74M 508k 0 37M 0 16:28 [0] test15# top -nCHSIzs1 | awk '$5 ~ /(K|SIZE)/ { printf " %7s %2s %6s %10s %15s %s\n", $7, $8, $9, $10, $11, $12}' STATE C TIME CPU COMMAND CPU10 6 0:40 100.00% kernel{ix0 que} CPU2 2 11:47 84.86% intr{irq258: ix0:que CPU3 3 11:50 81.88% intr{irq259: ix0:que CPU8 8 11:38 77.69% intr{irq264: ix0:que CPU7 7 11:24 77.10% intr{irq263: ix0:que WAIT 1 10:10 74.76% intr{irq257: ix0:que CPU4 4 8:57 63.48% intr{irq260: ix0:que CPU6 6 8:35 61.96% intr{irq262: ix0:que CPU9 9 14:01 60.79% intr{irq265: ix0:que RUN 0 9:07 59.67% intr{irq256: ix0:que WAIT 5 6:13 43.26% intr{irq261: ix0:que CPU11 11 5:19 35.89% kernel{ix0 que} - 4 3:41 25.49% kernel{ix0 que} - 1 3:22 21.78% kernel{ix0 que} - 1 2:55 17.68% kernel{ix0 que} - 4 2:24 16.55% kernel{ix0 que} - 1 9:54 14.99% kernel{ix0 que} CPU0 11 2:13 14.26% kernel{ix0 que} 16:07 [0] test15# vmstat -i | grep ix0 irq256: ix0:que 0 13654 15 irq257: ix0:que 1 87043 96 irq258: ix0:que 2 39604 44 irq259: ix0:que 3 48308 53 irq260: ix0:que 4 138002 153 irq261: ix0:que 5 169596 188 irq262: ix0:que 6 107679 119 irq263: ix0:que 7 72769 81 irq264: ix0:que 8 30878 34 irq265: ix0:que 9 1002032 1115 irq266: ix0:que 10 10967 12 irq267: ix0:link 1 0 Note that all cores are loaded more or less evenly, but the result is _worse_. The first reason for this is mtx_lock which is acquired twice on every lookup (once in in in_matroute() where it can possibly be removed and once again in rtalloc1_fib()). Latter one is addressed by andre@ in r234650). Additionally, despite itreads are bound to singe CPU each, kernel que are not in stock setup. However, configuration with 5 queues and 5 kernel threads bound to different CPU provides the same bad results. ================= Test 3 ======================= Kernel: FreeBSD-8-S June 4 SVN, +merged ifaddrlock, stock drivers, stock routing, no FLOWTABLE, no firewall packets errs idrops bytes packets errs bytes colls 580k 18k 0 38M 579k 0 37M 0 581k 26k 0 39M 580k 0 37M 0 580k 24k 0 39M 580k 0 37M 0 ................ Enabling ipfw _increases_ performance a bit: 604k 0 0 39M 604k 0 39M 0 604k 0 0 39M 604k 0 39M 0 582k 19k 0 38M 568k 0 37M 0 527k 81k 0 39M 530k 0 34M 0 605k 28 0 39M 605k 0 39M 0 ================= Test 3.1 ======================= Same as test 3, the only difference is the following: route add -net 10.100.1.160/27 -iface vlan11. input (ix0) output packets errs idrops bytes packets errs bytes colls 543k 879k 0 91M 544k 0 35M 0 547k 870k 0 91M 545k 0 35M 0 541k 870k 0 91M 539k 0 30M 0 952k 565k 0 97M 962k 0 48M 0 1.2M 228k 0 91M 1.2M 0 92M 0 1.2M 226k 0 90M 1.1M 0 76M 0 1.1M 228k 0 91M 1.2M 0 76M 0 1.2M 233k 0 90M 1.2M 0 76M 0 ================= Test 3.2 ======================= Same as test 3, splitting destination into 4 smaller rtes: route add -net 10.100.1.128/28 -iface vlan11 route add -net 10.100.1.144/28 -iface vlan11 route add -net 10.100.1.160/28 -iface vlan11 route add -net 10.100.1.176/28 -iface vlan11 input (ix0) output packets errs idrops bytes packets errs bytes colls 1.4M 0 0 106M 1.6M 0 106M 0 1.8M 0 0 106M 1.6M 0 71M 0 1.6M 0 0 106M 1.6M 0 71M 0 1.6M 0 0 87M 1.6M 0 71M 0 1.6M 0 0 126M 1.6M 0 212M 0 ================= Test 3.3 ======================= Same as test 3, splitting destination into 16 smaller rtes: input (ix0) output packets errs idrops bytes packets errs bytes colls 1.6M 0 0 118M 1.8M 0 118M 0 2.0M 0 0 118M 1.8M 0 119M 0 1.8M 0 0 119M 1.8M 0 79M 0 1.8M 0 0 117M 1.8M 0 157M 0 ================= Test 4 ======================= Kernel: FreeBSD-8-S June 4 SVN, stock drivers, routing patch 1, no FLOWTABLE, no firewall input (ix0) output packets errs idrops bytes packets errs bytes colls 1.8M 0 0 114M 1.9M 0 114M 0 1.7M 0 0 114M 1.7M 0 114M 0 1.8M 0 0 114M 1.8M 0 114M 0 1.7M 0 0 114M 1.7M 0 114M 0 1.8M 0 0 114M 1.8M 0 74M 0 1.5M 0 0 114M 1.8M 0 74M 0 2M 0 0 114M 1.8M 0 194M 0 Patch 1 totally eliminates mtx_lock for fastforwarding path to get an idea how much performance we can achieve. The result is nearly the same as in 3.3 ================= Test 4.1 ======================= Same as the test 4, same traffic level, enabling firewall with single allow rule (evaluating RLOCK performance) 22:35 [0] test15# netstat -I ix0 -hw 1 input (ix0) output packets errs idrops bytes packets errs bytes colls 1.8M 149k 0 114M 1.6M 0 142M 0 1.4M 148k 0 85M 1.6M 0 104M 0 1.8M 149k 0 143M 1.6M 0 104M 0 1.6M 151k 0 114M 1.6M 0 104M 0 1.6M 151k 0 114M 1.6M 0 104M 0 1.4M 152k 0 114M 1.6M 0 104M 0 E.g something like 10% performance loss. ================= Test 4.2 ======================= Same as test4, playing with number of queues. 5queues, same traffic level 1.5M 225k 0 114M 1.5M 0 99M 0 ================= Test 4.3 ======================= Same as test 4, HT on, number of queues = 16 input (ix0) output packets errs idrops bytes packets errs bytes colls 2.4M 0 0 157M 2.4M 0 156M 0 2.4M 0 0 156M 2.4M 0 157M 0 However, enabling firewall immediately drops rate to 1.9mpps which is nearly the same as 4.1 (and complicated fw ruleset possibly kill HT core much faster) ================= Test 4.3 ======================= Same as test4, kerwnel ix0 que Tx threads bound to specific CPUs (corresponding to RX ): 18:02 [0] test15# procstat -ak | grep ix0 | sort -nk 2 12 100045 intr irq256: ix0:que 0 100046 kernel ix0 que 12 100047 intr irq257: ix0:que 0 100048 kernel ix0 que mi_switch sleepq_wait msleep_spin taskqueue_thread_loop fork_exit fork_trampoline 12 100049 intr irq258: ix0:que .. test15# for i in `jot 12 0`; do cpuset -l $i -t $((100046+2*$i)); done Result: input (ix0) output packets errs idrops bytes packets errs bytes colls 2.1M 0 0 139M 2M 0 193M 0 2.1M 0 0 139M 2.3M 0 139M 0 2.1M 0 0 139M 2.1M 0 85M 0 2.1M 0 0 139M 2.1M 0 193M 0 Quite considerable increase, however this works better for uniform traffic distribution only. ================= Test 5 ======================= Same as test 4, make radix use rmlock (r234648, r234649). Result: 1.7 MPPS. ================= Test 6 ======================= Same as test 4 + FLOWTABLE Result: 1.7 MPPS. ================= Test 7 ======================= Same as test 4, build with GCC 4.7 Result: No performance gain Further investigations: ================= Test 8 ======================= Test 4 setup with kernel build with LOCK_PROFILING. 17:46 [0] test15# sysctl debug.lock.prof.enable=1 ; sleep 2 ; sysctl debug.lock.prof.enable=0 920k 0 0 59M 920k 0 59M 0 875k 0 0 59M 920k 0 59M 0 628k 0 0 39M 566k 0 45M 0 79k 2.7M 0 186M 57k 0 6.5M 0 71k 878k 0 61M 73k 0 4.0M 0 891k 254k 0 72M 917k 0 54M 0 920k 0 0 59M 920k 0 59M 0 When enabled, forwarding performance goes down to 60kpps. Enabled for 2 seconds (so actually 130k packets forwarded), results attached as separate file. Several hundred lock contentions in ixgbe, that's all. ================= Test 9 ======================= Same as test 4 setup with hwpmc. Results attached. ================= Test 9 ======================= Kernel: Freebsd-9-S. No major difference Some (my) preliminary conclusions: 1) rte mtx_lock should (and can) be eliminated from stock kernel. (And it can be done more or less easily for in_matroute). 2) rmlock vs rwlock performance difference is insignificant (maybe because of 3) ) 3) there are locks contention between ixgbe taskq threads and ithreads. I'm not sure if taskq threads are necessary in the case of packet forwarding and not traffic generation. Maybe I'm missing something else? (l2 cache misses or other things). What else I can do to debug this further? Relevant files: http://static.ipfw.ru/files/fbsd10g/0001-no-rt-mutex.patch http://static.ipfw.ru/files/fbsd10g/kernel.gprof.txt http://static.ipfw.ru/files/fbsd10g/prof_stats.txt ============= CONFIGS ==================== sysctl.conf: kern.ipc.maxsockbuf=33554432 net.inet.udp.maxdgram=65535 net.inet.udp.recvspace=16777216 net.inet.tcp.sendbuf_auto=0 net.inet.tcp.recvbuf_auto=0 net.inet.tcp.sendspace=16777216 net.inet.tcp.recvspace=16777216 net.inet.ip.maxfragsperpacket=64 kern.random.sys.harvest.ethernet=0 kern.random.sys.harvest.point_to_point=0 kern.random.sys.harvest.interrupt=0 net.inet.ip.forwarding=1 net.inet.ip.fastforwarding=1 net.inet.ip.redirect=0 hw.intr_storm_threshold=20000 loader.conf: kern.ipc.nmbclusters="512000" ixgbe_load="YES" hw.ixgbe.rx_process_limit="300" hw.ixgbe.nojumbobuf="1" hw.ixgbe.max_loop="100" hw.ixgbe.max_interrupt_rate="20000" hw.ixgbe.num_queues="11" hw.ixgbe.txd=4096 hw.ixgbe.rxd=4096 kern.hwpmc.nbuffers=2048 debug.debugger_on_panic=1 net.inet.ip.fw.default_to_accept=1 kernel: cpu HAMMER ident CORE_RELENG_7 options COMPAT_IA32 makeoptions DEBUG=-g # Build kernel with gdb(1) debug symbols options SCHED_ULE # ULE scheduler options PREEMPTION # Enable kernel thread preemption options INET # InterNETworking options INET6 # IPv6 communications protocols options SCTP # Stream Control Transmission Protocol options FFS # Berkeley Fast Filesystem options SOFTUPDATES # Enable FFS soft updates support options UFS_ACL # Support for access control lists options UFS_DIRHASH # Improve performance on big directories options UFS_GJOURNAL # Enable gjournal-based UFS journaling options MD_ROOT # MD is a potential root device options PROCFS # Process filesystem (requires PSEUDOFS) options PSEUDOFS # Pseudo-filesystem framework options GEOM_PART_GPT # GUID Partition Tables. options GEOM_LABEL # Provides labelization options COMPAT_43TTY # BSD 4.3 TTY compat [KEEP THIS!] options COMPAT_FREEBSD4 # Compatible with FreeBSD4 options COMPAT_FREEBSD5 # Compatible with FreeBSD5 options COMPAT_FREEBSD6 # Compatible with FreeBSD6 options COMPAT_FREEBSD7 # Compatible with FreeBSD7 options COMPAT_FREEBSD32 options SCSI_DELAY=4000 # Delay (in ms) before probing SCSI options KTRACE # ktrace(1) support options STACK # stack(9) support options SYSVSHM # SYSV-style shared memory options SYSVMSG # SYSV-style message queues options SYSVSEM # SYSV-style semaphores options _KPOSIX_PRIORITY_SCHEDULING # POSIX P1003_1B real-time extensions options KBD_INSTALL_CDEV # install a CDEV entry in /dev options AUDIT # Security event auditing options HWPMC_HOOKS options GEOM_MIRROR options MROUTING options PRINTF_BUFR_SIZE=100 # To make an SMP kernel, the next two lines are needed options SMP # Symmetric MultiProcessor Kernel # CPU frequency control device cpufreq # Bus support. device acpi device pci device ada device ahci # SCSI Controllers device ahd # AHA39320/29320 and onboard AIC79xx devices options AHD_REG_PRETTY_PRINT # Print register bitfields in debug # output. Adds ~215k to driver. device mpt # LSI-Logic MPT-Fusion # SCSI peripherals device scbus # SCSI bus (required for SCSI) device da # Direct Access (disks) device pass # Passthrough device (direct SCSI access) device ses # SCSI Environmental Services (and SAF-TE) # RAID controllers device mfi # LSI MegaRAID SAS # atkbdc0 controls both the keyboard and the PS/2 mouse device atkbdc # AT keyboard controller device atkbd # AT keyboard device psm # PS/2 mouse device kbdmux # keyboard multiplexer device vga # VGA video card driver device splash # Splash screen and screen saver support # syscons is the default console driver, resembling an SCO console device sc device agp # support several AGP chipsets ## Power management support (see NOTES for more options) #device apm ## Add suspend/resume support for the i8254. #device pmtimer # Serial (COM) ports #device sio # 8250, 16[45]50 based serial ports device uart # Generic UART driver # If you've got a "dumb" serial or parallel PCI card that is # supported by the puc(4) glue driver, uncomment the following # line to enable it (connects to sio, uart and/or ppc drivers): #device puc # PCI Ethernet NICs. device em # Intel PRO/1000 adapter Gigabit Ethernet Card device bce #device ixgb # Intel PRO/10GbE Ethernet Card #device ixgbe # PCI Ethernet NICs that use the common MII bus controller code. # NOTE: Be sure to keep the 'device miibus' line in order to use these NICs! device miibus # MII bus support # Pseudo devices. device loop # Network loopback device random # Entropy device device ether # Ethernet support device pty # Pseudo-ttys (telnet etc) device md # Memory "disks" device firmware # firmware assist module device lagg # The `bpf' device enables the Berkeley Packet Filter. # Be aware of the administrative consequences of enabling this! # Note that 'bpf' is required for DHCP. device bpf # Berkeley packet filter # USB support device uhci # UHCI PCI->USB interface device ohci # OHCI PCI->USB interface device ehci # EHCI PCI->USB interface (USB 2.0) device usb # USB Bus (required) #device udbp # USB Double Bulk Pipe devices device uhid # "Human Interface Devices" device ukbd # Keyboard device umass # Disks/Mass storage - Requires scbus and da device ums # Mouse # USB Serial devices device ucom # Generic com ttys options INCLUDE_CONFIG_FILE options KDB options KDB_UNATTENDED options DDB options ALT_BREAK_TO_DEBUGGER options IPFIREWALL #firewall options IPFIREWALL_FORWARD #packet destination changes options IPFIREWALL_VERBOSE #print information about # dropped packets options IPFIREWALL_VERBOSE_LIMIT=10000 #limit verbosity # MRT support options ROUTETABLES=16 device vlan #VLAN support # Size of the kernel message buffer. Should be N * pagesize. options MSGBUF_SIZE=4096000 options SW_WATCHDOG options PANIC_REBOOT_WAIT_TIME=4 # # Hardware watchdog timers: # # ichwd: Intel ICH watchdog timer # #device ichwd device smbus device ichsmb device ipmi -- WBR, Alexander