FreeBSD Mail Archives

Date:      Wed, 12 Aug 2015 16:03:44 +0430
From:      Babak Farrokhi <farrokhi@FreeBSD.org>
To:        "Alexander V. Chernikov" <melifaro@ipfw.ru>
Cc:        Maxim Sobolev <sobomax@freebsd.org>, =?UTF-8?B?T2xpdmllciBDb2NoYXJkLUw=?= =?UTF-8?B?YWJiw6k=?= <olivier@cochard.me>,  FreeBSD Net <freebsd-net@freebsd.org>, "freebsd@intel.com" <freebsd@intel.com>, =?UTF-8?B?SmV2IEJqw7Zyc2VsbA==?= <jev@sippysoft.com>
Subject:   Re: Poor high-PPS performance of the 10G ixgbe(9) NIC/driver in FreeBSD 10.1
Message-ID:  <55CB2F18.40902@FreeBSD.org>
In-Reply-To: <77171439377164@web21h.yandex.ru>
References:  null <CAH7qZft-CZCKv_7E9PE%2B4ZN3EExhezMnAb3kvShQzYhRYb2jMg@mail.gmail.com> <77171439377164@web21h.yandex.ru>

I ran into the same problem with almost the same hardware (Intel X520)
on 10-STABLE. HT/SMT is disabled and cards are configured with 8 queues,
with the same sysctl tunings as sobomax@ did. I am not using lagg, no
FLOWTABLE.

I experimented with pmcstat (RESOURCE_STALLS) a while ago and here [1]
[2] you can see the results, including pmc output, callchain, flamegraph
and gprof output.

I am experiencing huge number of interrupts with 200kpps load:

# sysctl dev.ix | grep interrupt_rate
dev.ix.1.queue7.interrupt_rate: 125000
dev.ix.1.queue6.interrupt_rate: 6329
dev.ix.1.queue5.interrupt_rate: 500000
dev.ix.1.queue4.interrupt_rate: 100000
dev.ix.1.queue3.interrupt_rate: 50000
dev.ix.1.queue2.interrupt_rate: 500000
dev.ix.1.queue1.interrupt_rate: 500000
dev.ix.1.queue0.interrupt_rate: 100000
dev.ix.0.queue7.interrupt_rate: 500000
dev.ix.0.queue6.interrupt_rate: 6097
dev.ix.0.queue5.interrupt_rate: 10204
dev.ix.0.queue4.interrupt_rate: 5208
dev.ix.0.queue3.interrupt_rate: 5208
dev.ix.0.queue2.interrupt_rate: 71428
dev.ix.0.queue1.interrupt_rate: 5494
dev.ix.0.queue0.interrupt_rate: 6250

[1] http://farrokhi.net/~farrokhi/pmc/6/
[2] http://farrokhi.net/~farrokhi/pmc/7/

Regards,
Babak


Alexander V. Chernikov wrote:
> 12.08.2015, 02:28, "Maxim Sobolev" <sobomax@FreeBSD.org>:
>> Olivier, keep in mind that we are not "kernel forwarding" packets, but "app
>> forwarding", i.e. the packet goes full way
>> net->kernel->recvfrom->app->sendto->kernel->net, which is why we have much
>> lower PPS limits and which is why I think we are actually benefiting from
>> the extra queues. Single-thread sendto() in a loop is CPU-bound at about
>> 220K PPS, and while running the test I am observing that outbound traffic
>> from one thread is mapped into a specific queue (well, pair of queues on
>> two separate adaptors, due to lagg load balancing action). And the peak
>> performance of that test is at 7 threads, which I believe corresponds to
>> the number of queues. We have plenty of CPU cores in the box (24) with
>> HTT/SMT disabled and one CPU is mapped to a specific queue. This leaves us
>> with at least 8 CPUs fully capable of running our app. If you look at the
>> CPU utilization, we are at about 10% when the issue hits.
> 
> In any case, it would be great if you could provide some profiling info since there could be
> plenty of problematic places starting from TX rings contention to some locks inside udp or even
> (in)famous random entropy harvester..
> e.g. something like pmcstat -TS instructions -w1 might be sufficient to determine the reason
>> ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 2.5.15> port
>> 0x6020-0x603f mem 0xc7c00000-0xc7dfffff,0xc7e04000-0xc7e07fff irq 40 at
>> device 0.0 on pci3
>> ix0: Using MSIX interrupts with 9 vectors
>> ix0: Bound queue 0 to cpu 0
>> ix0: Bound queue 1 to cpu 1
>> ix0: Bound queue 2 to cpu 2
>> ix0: Bound queue 3 to cpu 3
>> ix0: Bound queue 4 to cpu 4
>> ix0: Bound queue 5 to cpu 5
>> ix0: Bound queue 6 to cpu 6
>> ix0: Bound queue 7 to cpu 7
>> ix0: Ethernet address: 0c:c4:7a:5e:be:64
>> ix0: PCI Express Bus: Speed 5.0GT/s Width x8
>> 001.000008 [2705] netmap_attach success for ix0 tx 8/4096 rx
>> 8/4096 queues/slots
>> ix1: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 2.5.15> port
>> 0x6000-0x601f mem 0xc7a00000-0xc7bfffff,0xc7e00000-0xc7e03fff irq 44 at
>> device 0.1 on pci3
>> ix1: Using MSIX interrupts with 9 vectors
>> ix1: Bound queue 0 to cpu 8
>> ix1: Bound queue 1 to cpu 9
>> ix1: Bound queue 2 to cpu 10
>> ix1: Bound queue 3 to cpu 11
>> ix1: Bound queue 4 to cpu 12
>> ix1: Bound queue 5 to cpu 13
>> ix1: Bound queue 6 to cpu 14
>> ix1: Bound queue 7 to cpu 15
>> ix1: Ethernet address: 0c:c4:7a:5e:be:65
>> ix1: PCI Express Bus: Speed 5.0GT/s Width x8
>> 001.000009 [2705] netmap_attach success for ix1 tx 8/4096 rx
>> 8/4096 queues/slots
>>
>> On Tue, Aug 11, 2015 at 4:14 PM, Olivier Cochard-Labbé <olivier@cochard.me>
>> wrote:
>>
>>>  On Tue, Aug 11, 2015 at 11:18 PM, Maxim Sobolev <sobomax@freebsd.org>
>>>  wrote:
>>>
>>>>  Hi folks,
>>>>
>>>>  Hi,
>>>  
>>>
>>>>  We've trying to migrate some of our high-PPS systems to a new hardware
>>>>  that
>>>>  has four X540-AT2 10G NICs and observed that interrupt time goes through
>>>>  roof after we cross around 200K PPS in and 200K out (two ports in LACP).
>>>>  The previous hardware was stable up to about 350K PPS in and 350K out. I
>>>>  believe the old one was equipped with the I350 and had the identical LACP
>>>>  configuration. The new box also has better CPU with more cores (i.e. 24
>>>>  cores vs. 16 cores before). CPU itself is 2 x E5-2690 v3.
>>>  200K PPS, and even 350K PPS are very low value indeed.
>>>  On a Intel Xeon L5630 (4 cores only) with one X540-AT2
>>>
>>>  (then 2 10Gigabit ports) I've reached about 1.8Mpps (fastforwarding
>>>  enabled) [1].
>>>  But my setup didn't use lagg(4): Can you disable lagg configuration and
>>>  re-measure your performance without lagg ?
>>>
>>>  Do you let Intel NIC drivers using 8 queues for port too?
>>>  In my use case (forwarding smallest UDP packet size), I obtain better
>>>  behaviour by limiting NIC queues to 4 (hw.ix.num_queues or
>>>  hw.ixgbe.num_queues, don't remember) if my system had 8 cores. And this
>>>  with Gigabit Intel[2] or Chelsio NIC [3].
>>>
>>>  Don't forget to disable TSO and LRO too.
>>>
>>>  Regards,
>>>
>>>  Olivier
>>>
>>>  [1]
>>>  http://bsdrp.net/documentation/examples/forwarding_performance_lab_of_an_ibm_system_x3550_m3_with_10-gigabit_intel_x540-at2#graphs
>>>  [2]
>>>  http://bsdrp.net/documentation/examples/forwarding_performance_lab_of_a_superserver_5018a-ftn4#graph1
>>>  [3]
>>>  http://bsdrp.net/documentation/examples/forwarding_performance_lab_of_a_hp_proliant_dl360p_gen8_with_10-gigabit_with_10-gigabit_chelsio_t540-cr#reducing_nic_queues
>> _______________________________________________
>> freebsd-net@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?55CB2F18.40902>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation