Date: Sat, 20 Sep 2014 09:12:07 -0700 From: Rumen Telbizov <telbizov@gmail.com> To: "freebsd-stable@freebsd.org" <freebsd-stable@freebsd.org> Subject: FreeBSD 10 network performance problems Message-ID: <CAENR%2B_VDVvnY0zWKVXHOjz2vWw27s%2ByVrz9ZFokZ=p6P6oFNvw@mail.gmail.com>
next in thread | raw e-mail | index | archive | help
Hello everyone, I am in the process of upgrading our main PF firewalls from 9.2-RC4 to 10.1-BETA1 (r271684) and as part of the process I have been testing the forwarding capability of FreeBSD 10 (pf firewall disabled) to have a base-line and find any bottlenecks on a 10GbE network. My tests show that for the first 3-4Gbit/s of traffic things are great and then the machine simply "hits the wall" at around 4-5Gbit/s of traffic. There's no gradual degradation but a hard drop to 0% idle and 50-50% split of system and interrupt CPU load. I ran some diagnostics and I was hoping someone could point me in the right direction as to what is happening and what I could do to improve the situation. The details I have collected so far are below: I run multiple iperf tcp multithreaded instances to generate traffic which traverses the firewall. As mentioned above for the first 3-4Gbit/s traffic the machine doesn't even break a sweat. *top header* during this load when things are good: CPU 0: 0.0% user, 0.0% nice, 0.0% system, 27.6% interrupt, 72.4% idle CPU 1: 0.0% user, 0.0% nice, 0.0% system, 27.6% interrupt, 72.4% idle CPU 2: 0.0% user, 0.0% nice, 0.0% system, 8.7% interrupt, 91.3% idle CPU 3: 0.0% user, 0.0% nice, 0.0% system, 17.3% interrupt, 82.7% idle CPU 4: 0.0% user, 0.0% nice, 0.0% system, 12.6% interrupt, 87.4% idle CPU 5: 0.0% user, 0.0% nice, 0.0% system, 4.7% interrupt, 95.3% idle CPU 6: 0.0% user, 0.0% nice, 0.0% system, 13.4% interrupt, 86.6% idle CPU 7: 0.0% user, 0.0% nice, 0.0% system, 26.0% interrupt, 74.0% idle CPU 8: 0.0% user, 0.0% nice, 0.0% system, 16.5% interrupt, 83.5% idle CPU 9: 0.0% user, 0.0% nice, 0.0% system, 1.6% interrupt, 98.4% idle CPU 10: 0.0% user, 0.0% nice, 0.0% system, 19.7% interrupt, 80.3% idle CPU 11: 0.0% user, 0.0% nice, 0.0% system, 7.1% interrupt, 92.9% idle Full output at http://pastebin.com/gaaisXV8 *bmon* at the same time: Interfaces =E2=94=82 RX bps pps %=E2=94=82 TX= bps pps % ix0 =E2=94=82 240.33MiB 242.20K =E2=94=82 22= 1.41MiB 236.68K ix1 =E2=94=82 246.51MiB 248.80K =E2=94=82 26= 1.61MiB 250.95K >lagg0 =E2=94=82 483.45MiB 492.42K =E2=94=82 47= 9.54MiB 488.02K MiB (RX Bytes/second) MiB (TX Bytes/second) 499.17 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 496.49 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 415.98 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 413.74 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 332.78 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 331.00 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 249.59 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 248.25 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 166.39 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 165.50 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 83.20 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 82.75 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 1 5 10 15 20 25 30 35 40 45 50 55 60 1 5 10 15 20 25 30 35 40 45 50 55 6= 0 K (RX Packets/second) K (TX Packets/second) 508.27 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 505.14 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 423.56 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 420.95 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 338.85 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 336.76 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 254.14 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 252.57 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 169.42 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 168.38 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 84.71 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 84.19 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 1 5 10 15 20 25 30 35 40 45 50 55 60 1 5 10 15 20 25 30 35 40 45 50 55 6= 0 To profile it without having it degrade sooner I reduced traffic to 2Gbit/s and ran dtrace hotkernel and lock profiling. Here are the results: */usr/share/dtrace/toolkit/hotkernel* for 60 seconds: kernel`__mtx_lock_flags 5812 0.8% kernel`__mtx_unlock_flags 7200 1.0% kernel`acpi_cpu_c1 7799 1.1% kernel`__rw_rlock 11196 1.5% kernel`spinlock_exit 14547 2.0% kernel`cpu_idle 166700 22.8% kernel`sched_idletd 461883 63.1% Full output at http://pastebin.com/w7WfFwPG *lock profiling* for 60 seconds: $ head -n 2 good-locks ; cat good-locks | sort -n -k 4 | tail -n6 debug.lock.prof.stats: max wait_max total wait_total count avg wait_avg cnt_hold cnt_lock name 22 24 94378 18481 264549 0 0 0 18639 /usr/src/sys/kern/sched_ule.c:886 (spin mutex:sched lock 10) 22 24 112366 20360 219179 0 0 0 17220 /usr/src/sys/kern/sched_ule.c:888 (spin mutex:sched lock 2) 18 319 3486 22352 4233 0 5 0 1640 /usr/src/sys/kern/subr_taskqueue.c:344 (sleep mutex:taskqueue) 26 66 3219768 204875 14616220 0 0 0 133154 /usr/src/sys/net/route.c:439 (sleep mutex:rtentry) 25 90 1923012 2353820 14615738 0 0 0 1562097 /usr/src/sys/netinet/ip_fastfwd.c:593 (sleep mutex:rtentry) 26 91 1398443 2391458 14616137 0 0 0 1604332 /usr/src/sys/netinet/in_rmx.c:114 (sleep mutex:rtentry) Full output at http://pastebin.com/qiG3ZuAH Again, above measurements demonstrate the state of the good/healthy system under moderate traffic load of 3-4Gbit/s with pf disabled and fast forwarding enabled. Here are the same measurements when I add an additional 1Gbit/s of traffic. The point when performance tanks varies between 4-5Gbit/s but it's always sudden, without any gradual degradation but instead idle simply drops down to 0. Let's take a look: *top header* during this load when things are bad: CPU 0: 0.0% user, 0.0% nice, 44.6% system, 55.4% interrupt, 0.0% idle CPU 1: 0.0% user, 0.0% nice, 15.1% system, 84.9% interrupt, 0.0% idle CPU 2: 0.0% user, 0.0% nice, 59.0% system, 40.3% interrupt, 0.7% idle CPU 3: 0.0% user, 0.0% nice, 57.6% system, 42.4% interrupt, 0.0% idle CPU 4: 0.0% user, 0.0% nice, 34.5% system, 63.3% interrupt, 2.2% idle CPU 5: 0.0% user, 0.0% nice, 51.8% system, 48.2% interrupt, 0.0% idle CPU 6: 0.0% user, 0.0% nice, 33.8% system, 66.2% interrupt, 0.0% idle CPU 7: 0.7% user, 0.0% nice, 49.6% system, 49.6% interrupt, 0.0% idle CPU 8: 0.0% user, 0.0% nice, 66.2% system, 33.8% interrupt, 0.0% idle CPU 9: 0.0% user, 0.0% nice, 35.3% system, 64.7% interrupt, 0.0% idle CPU 10: 0.0% user, 0.0% nice, 54.7% system, 45.3% interrupt, 0.0% idle CPU 11: 0.0% user, 0.0% nice, 34.5% system, 65.5% interrupt, 0.0% idle Full output at http://pastebin.com/9an9ZWv2 *bmon *at the same time shows inconsistent performance with big dips: Interfaces =E2=94=82 RX bps pps %=E2=94=82 TX= bps pps % ix0 =E2=94=82 153.89MiB 151.53K =E2=94=82 17= 9.69MiB 159.91K ix1 =E2=94=82 176.56MiB 161.29K =E2=94=82 14= 5.17MiB 148.13K >lagg0 =E2=94=82 327.23MiB 333.05K =E2=94=82 32= 2.14MiB 328.13K MiB (RX Bytes/second) MiB (TX Bytes/second) 657.86 ..............|............................................. 648.60 ..............|............................................. 548.21 ..............|............................................. 540.50 ..............|............................................. 438.57 ...........|..|.........................|.....|.......|..|.. 432.40 ...........|..|.........................|.....|.......|..|.. 328.93 |..||||..||||.||...||..|||.||.|.|||||||||.|||||||||.||||||.. 324.30 |..||||..||||.||...||..|||.||.|.|||||||||.|||||||||.||||||.. 219.29 |..||||..||||.||..|||.||||.||.|||||||||||.|||||||||.||||||.. 216.20 |..||||..||||.||..|||.||||.||.|||||||||||.|||||||||.||||||.. 109.64 |.|||||..||||.||..|||.||||.||.|||||||||||.|||||||||.||||||.. 108.10 |.|||||..||||.||..|||.||||.||.|||||||||||.|||||||||.||||||.. 1 5 10 15 20 25 30 35 40 45 50 55 60 1 5 10 15 20 25 30 35 40 45 50 55 60 K (RX Packets/second) K (TX Packets/second) 670.41 ..............|............................................. 660.27 ..............|............................................. 558.67 ..............|............................................. 550.22 ..............|............................................. 446.94 ...........|..|.........................|.....|.......|..|.. 440.18 ...........|..|.........................|.....|.......|..|.. 335.20 |..||||..||||.||...||..|||.||.|.|||||||||.|||||||||.||||||.. 330.13 |..||||..||||.||...||..|||.||.|.|||||||||.|||||||||.||||||.. 223.47 |..||||..||||.||..|||.||||.||.|||||||||||.|||||||||.||||||.. 220.09 |..||||..||||.||..|||.||||.||.|||||||||||.|||||||||.||||||.. 111.73 |.|||||..||||.||..|||.||||.||.|||||||||||.|||||||||.||||||.. 110.04 |.|||||..||||.||..|||.||||.||.|||||||||||.|||||||||.||||||.. 1 5 10 15 20 25 30 35 40 45 50 55 60 1 5 10 15 20 25 30 35 40 45 50 55 60 */usr/share/dtrace/toolkit/hotkernel* for 60 seconds: kernel`_rw_runlock_cookie 7709 1.1% kernel`__rw_rlock 11182 1.6% kernel`ip_fastforward 12231 1.7% kernel`__mtx_lock_flags 22004 3.1% kernel`__mtx_unlock_flags 35614 5.0% kernel`__mtx_lock_sleep 560768 78.5% Full output at http://pastebin.com/NurKwkWL *lock profiling* for 60 seconds: $ head -n 2 bad-locks ; cat bad-locks | sort -n -k 4 | tail -n6 debug.lock.prof.stats: max wait_max total wait_total count avg wait_avg cnt_hold cnt_lock name 401766 191987 1857397 194162 179 10376 1084 0 2 /usr/src/sys/kern/kern_sysctl.c:1601 (sx:sysctl mem) 21064 207907 62556 249066 396 157 628 0 3 /usr/src/sys/kern/kern_sysctl.c:1499 (sleep mutex:Giant) 1 370663 17 372573 86 0 4332 0 2 /usr/src/sys/kern/kern_exit.c:429 (sx:allproc) 14 648 8856844 46296098 15513956 0 2 0 1467849 /usr/src/sys/net/route.c:439 (sleep mutex:rtentry) 15 958 13107581 252445472 15513486 0 16 0 9444644 /usr/src/sys/netinet/ip_fastfwd.c:593 (sleep mutex:rtentry) 12 779 12500816 286324556 15513872 0 18 0 9788497 /usr/src/sys/netinet/in_rmx.c:114 (sleep mutex:rtentry) Full output at http://pastebin.com/d54QmP13 System hardware is: 12 x E5-2440 @ 2.40GHz, 24GB RAM, Dual port fiber Intel(R) PRO/10GbE in lacp lagg System configuration details available at http://pastebin.com/tPvs1MeD It seems to me like heavy lock contention but I don't understand why it happens in such an abrupt manner after a given threshold. Other things I tried: - Upgrade ix (ixgbe) driver to the latest from Intel (2.5.25) - for some reason I cannot send any packets out - enable flowtables - no difference Any help is highly appreciated. I could provide further details and run additional tests upon request. Regrads, --=20 Rumen Telbizov Unix Systems Administrator <http://telbizov.com>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAENR%2B_VDVvnY0zWKVXHOjz2vWw27s%2ByVrz9ZFokZ=p6P6oFNvw>