Date: Wed, 12 Aug 2015 16:03:44 +0430 From: Babak Farrokhi <farrokhi@FreeBSD.org> To: "Alexander V. Chernikov" <melifaro@ipfw.ru> Cc: Maxim Sobolev <sobomax@freebsd.org>, =?UTF-8?B?T2xpdmllciBDb2NoYXJkLUw=?= =?UTF-8?B?YWJiw6k=?= <olivier@cochard.me>, FreeBSD Net <freebsd-net@freebsd.org>, "freebsd@intel.com" <freebsd@intel.com>, =?UTF-8?B?SmV2IEJqw7Zyc2VsbA==?= <jev@sippysoft.com> Subject: Re: Poor high-PPS performance of the 10G ixgbe(9) NIC/driver in FreeBSD 10.1 Message-ID: <55CB2F18.40902@FreeBSD.org> In-Reply-To: <77171439377164@web21h.yandex.ru> References: null <CAH7qZft-CZCKv_7E9PE%2B4ZN3EExhezMnAb3kvShQzYhRYb2jMg@mail.gmail.com> <77171439377164@web21h.yandex.ru>
next in thread | previous in thread | raw e-mail | index | archive | help
I ran into the same problem with almost the same hardware (Intel X520) on 10-STABLE. HT/SMT is disabled and cards are configured with 8 queues, with the same sysctl tunings as sobomax@ did. I am not using lagg, no FLOWTABLE. I experimented with pmcstat (RESOURCE_STALLS) a while ago and here [1] [2] you can see the results, including pmc output, callchain, flamegraph and gprof output. I am experiencing huge number of interrupts with 200kpps load: # sysctl dev.ix | grep interrupt_rate dev.ix.1.queue7.interrupt_rate: 125000 dev.ix.1.queue6.interrupt_rate: 6329 dev.ix.1.queue5.interrupt_rate: 500000 dev.ix.1.queue4.interrupt_rate: 100000 dev.ix.1.queue3.interrupt_rate: 50000 dev.ix.1.queue2.interrupt_rate: 500000 dev.ix.1.queue1.interrupt_rate: 500000 dev.ix.1.queue0.interrupt_rate: 100000 dev.ix.0.queue7.interrupt_rate: 500000 dev.ix.0.queue6.interrupt_rate: 6097 dev.ix.0.queue5.interrupt_rate: 10204 dev.ix.0.queue4.interrupt_rate: 5208 dev.ix.0.queue3.interrupt_rate: 5208 dev.ix.0.queue2.interrupt_rate: 71428 dev.ix.0.queue1.interrupt_rate: 5494 dev.ix.0.queue0.interrupt_rate: 6250 [1] http://farrokhi.net/~farrokhi/pmc/6/ [2] http://farrokhi.net/~farrokhi/pmc/7/ Regards, Babak Alexander V. Chernikov wrote: > 12.08.2015, 02:28, "Maxim Sobolev" <sobomax@FreeBSD.org>: >> Olivier, keep in mind that we are not "kernel forwarding" packets, but "app >> forwarding", i.e. the packet goes full way >> net->kernel->recvfrom->app->sendto->kernel->net, which is why we have much >> lower PPS limits and which is why I think we are actually benefiting from >> the extra queues. Single-thread sendto() in a loop is CPU-bound at about >> 220K PPS, and while running the test I am observing that outbound traffic >> from one thread is mapped into a specific queue (well, pair of queues on >> two separate adaptors, due to lagg load balancing action). And the peak >> performance of that test is at 7 threads, which I believe corresponds to >> the number of queues. We have plenty of CPU cores in the box (24) with >> HTT/SMT disabled and one CPU is mapped to a specific queue. This leaves us >> with at least 8 CPUs fully capable of running our app. If you look at the >> CPU utilization, we are at about 10% when the issue hits. > > In any case, it would be great if you could provide some profiling info since there could be > plenty of problematic places starting from TX rings contention to some locks inside udp or even > (in)famous random entropy harvester.. > e.g. something like pmcstat -TS instructions -w1 might be sufficient to determine the reason >> ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 2.5.15> port >> 0x6020-0x603f mem 0xc7c00000-0xc7dfffff,0xc7e04000-0xc7e07fff irq 40 at >> device 0.0 on pci3 >> ix0: Using MSIX interrupts with 9 vectors >> ix0: Bound queue 0 to cpu 0 >> ix0: Bound queue 1 to cpu 1 >> ix0: Bound queue 2 to cpu 2 >> ix0: Bound queue 3 to cpu 3 >> ix0: Bound queue 4 to cpu 4 >> ix0: Bound queue 5 to cpu 5 >> ix0: Bound queue 6 to cpu 6 >> ix0: Bound queue 7 to cpu 7 >> ix0: Ethernet address: 0c:c4:7a:5e:be:64 >> ix0: PCI Express Bus: Speed 5.0GT/s Width x8 >> 001.000008 [2705] netmap_attach success for ix0 tx 8/4096 rx >> 8/4096 queues/slots >> ix1: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 2.5.15> port >> 0x6000-0x601f mem 0xc7a00000-0xc7bfffff,0xc7e00000-0xc7e03fff irq 44 at >> device 0.1 on pci3 >> ix1: Using MSIX interrupts with 9 vectors >> ix1: Bound queue 0 to cpu 8 >> ix1: Bound queue 1 to cpu 9 >> ix1: Bound queue 2 to cpu 10 >> ix1: Bound queue 3 to cpu 11 >> ix1: Bound queue 4 to cpu 12 >> ix1: Bound queue 5 to cpu 13 >> ix1: Bound queue 6 to cpu 14 >> ix1: Bound queue 7 to cpu 15 >> ix1: Ethernet address: 0c:c4:7a:5e:be:65 >> ix1: PCI Express Bus: Speed 5.0GT/s Width x8 >> 001.000009 [2705] netmap_attach success for ix1 tx 8/4096 rx >> 8/4096 queues/slots >> >> On Tue, Aug 11, 2015 at 4:14 PM, Olivier Cochard-Labbé <olivier@cochard.me> >> wrote: >> >>> On Tue, Aug 11, 2015 at 11:18 PM, Maxim Sobolev <sobomax@freebsd.org> >>> wrote: >>> >>>> Hi folks, >>>> >>>> Hi, >>> >>> >>>> We've trying to migrate some of our high-PPS systems to a new hardware >>>> that >>>> has four X540-AT2 10G NICs and observed that interrupt time goes through >>>> roof after we cross around 200K PPS in and 200K out (two ports in LACP). >>>> The previous hardware was stable up to about 350K PPS in and 350K out. I >>>> believe the old one was equipped with the I350 and had the identical LACP >>>> configuration. The new box also has better CPU with more cores (i.e. 24 >>>> cores vs. 16 cores before). CPU itself is 2 x E5-2690 v3. >>> 200K PPS, and even 350K PPS are very low value indeed. >>> On a Intel Xeon L5630 (4 cores only) with one X540-AT2 >>> >>> (then 2 10Gigabit ports) I've reached about 1.8Mpps (fastforwarding >>> enabled) [1]. >>> But my setup didn't use lagg(4): Can you disable lagg configuration and >>> re-measure your performance without lagg ? >>> >>> Do you let Intel NIC drivers using 8 queues for port too? >>> In my use case (forwarding smallest UDP packet size), I obtain better >>> behaviour by limiting NIC queues to 4 (hw.ix.num_queues or >>> hw.ixgbe.num_queues, don't remember) if my system had 8 cores. And this >>> with Gigabit Intel[2] or Chelsio NIC [3]. >>> >>> Don't forget to disable TSO and LRO too. >>> >>> Regards, >>> >>> Olivier >>> >>> [1] >>> http://bsdrp.net/documentation/examples/forwarding_performance_lab_of_an_ibm_system_x3550_m3_with_10-gigabit_intel_x540-at2#graphs >>> [2] >>> http://bsdrp.net/documentation/examples/forwarding_performance_lab_of_a_superserver_5018a-ftn4#graph1 >>> [3] >>> http://bsdrp.net/documentation/examples/forwarding_performance_lab_of_a_hp_proliant_dl360p_gen8_with_10-gigabit_with_10-gigabit_chelsio_t540-cr#reducing_nic_queues >> _______________________________________________ >> freebsd-net@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-net >> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?55CB2F18.40902>