Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 28 Feb 2017 21:07:15 +0100
From:      Sydney Meyer <meyer.sydney@googlemail.com>
To:        freebsd-net@freebsd.org
Subject:   Re: Disappointing packets-per-second performance results on a Dell, PE R530
Message-ID:  <81FD8C20-B6AA-42A0-899A-62D4DA4A11DC@googlemail.com>
In-Reply-To: <7546B456-94A9-4603-A07F-4E0AB0285E1A@gmail.com>
References:  <ebb04a3e-bcde-6d50-af63-348e8d06fcba@gmail.com> <40a413f3-2c44-ee9d-9961-67114d8dffca@gmail.com> <20170205175531.GA20287@dwarf> <7d349edd-0c81-2e3f-d3b9-27af232de76d@gmail.com> <20170209153409.GG41673@dwarf> <6ad029e0-86c6-af3d-8fc3-694d4bcdc683@gmail.com> <7546B456-94A9-4603-A07F-4E0AB0285E1A@gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Hello,

perhaps you've already gone through these, but this site summarizes also =
a few important tweaks, e.g. disabling entropy harvest, etc..

Also, there are detailed and well documented benchmarks about FreeBSD =
Routing Performance, IIRC, also with Chelsio Cards..

https://bsdrp.net/documentation/technical_docs/performance

Sydney

> On 28 Feb 2017, at 07:35, Ben RUBSON <ben.rubson@gmail.com> wrote:
>=20
> Hi,
>=20
> Try disabling NUMA in your BIOS settings ?
> I had perf issue on a 2-CPU (24 cores) server, I was not able to run a =
40G NIC at its max throughput.
> We investigated a lot, disabling NUMA in the BIOS was the solution, as =
NUMA is not fully supported yet (as of stable/11).
>=20
> Ben
>=20
>=20
>=20
>> On 28 Feb 2017, at 03:13, Caraballo-vega, Jordan A. =
(GSFC-6062)[COMPUTER SCIENCE CORP] <jordancaraballo87@gmail.com> wrote:
>>=20
>> As a summarywe have a Dell R530 with a Chelsio T580 cardwith =
-CURRENT.
>>=20
>> In an attempt to reduce the time the system was taking to look for =
the
>> cpus; we changed the BIOS setting to let the system have 8 visible =
cores
>> and tested cxl* and vcxl* chelsio interfaces. Scores are still way =
lower
>> than what we expected:
>>=20
>> Cxl interface
>>=20
>> root@router1:~ # netstat -w1 -h
>>           input        (Total)           output
>>  packets  errs idrops      bytes    packets  errs      bytes colls
>>     4.1M     0  3.4M       2.1G       725k     0       383M     0
>>     3.7M     0  3.1M       1.9G       636k     0       336M     0
>>     3.9M     0  3.2M       2.0G       684k     0       362M     0
>>     4.0M     0  3.3M       2.1G       702k     0       371M     0
>>     3.8M     0  3.2M       2.0G       658k     0       348M     0
>>     3.9M     0  3.2M       2.0G       658k     0       348M     0
>>     3.9M     0  3.2M       2.0G       721k     0       381M     0
>>     3.3M     0  2.6M       1.7G       681k     0       360M     0
>>     3.2M     0  2.5M       1.7G       666k     0       352M     0
>>     2.6M     0  2.0M       1.4G       620k     0       328M     0
>>     2.8M     0  2.1M       1.4G       615k     0       325M     0
>>     3.2M     0  2.6M       1.7G       612k     0       323M     0
>>     3.3M     0  2.7M       1.7G       664k     0       351M     0
>>=20
>>=20
>> Vcxl interface
>>  input        (Total)           output
>>  packets  errs idrops      bytes    packets  errs      bytes colls =
drops
>>     590k  7.5k     0       314M       590k     0       314M     0     =
0
>>     526k  6.6k     0       280M       526k     0       280M     0     =
0
>>     588k  7.1k     0       313M       588k     0       313M     0     =
0
>>     532k  6.6k     0       283M       532k     0       283M     0     =
0
>>     578k  7.2k     0       307M       578k     0       307M     0     =
0
>>     565k  7.0k     0       300M       565k     0       300M     0     =
0
>>     558k  7.0k     0       297M       558k     0       297M     0     =
0
>>     533k  6.7k     0       284M       533k     0       284M     0     =
0
>>     588k  7.3k     0       313M       588k     0       313M     0     =
0
>>     553k  6.9k     0       295M       554k     0       295M     0     =
0
>>     527k  6.7k     0       281M       527k     0       281M     0     =
0
>>     585k  7.4k     0       311M       585k     0       311M     0     =
0
>>=20
>> Related to pmcstat scores are:
>>=20
>> root@router1:~/PMC_Stats/Feb22 #  pmcstat -R sample.out -G - | head
>> @ CPU_CLK_UNHALTED_CORE [2091 samples]
>>=20
>> 15.35%  [321]      lock_delay @ /boot/kernel/kernel
>> 94.70%  [304]       _mtx_lock_spin_cookie
>> 100.0%  [304]        __mtx_lock_spin_flags
>>  57.89%  [176]         pmclog_loop @ /boot/kernel/hwpmc.ko
>>   100.0%  [176]          fork_exit @ /boot/kernel/kernel
>>  41.12%  [125]         pmclog_reserve @ /boot/kernel/hwpmc.ko
>>   100.0%  [125]          pmclog_process_callchain
>>    100.0%  [125]           pmc_process_samples
>>=20
>> root@router1:~/PMC_Stats/Feb22 # pmcstat -R sample0.out -G - | head
>> @ CPU_CLK_UNHALTED_CORE [480 samples]
>>=20
>> 37.29%  [179]      acpi_cpu_idle_mwait @ /boot/kernel/kernel
>> 100.0%  [179]       acpi_cpu_idle
>> 100.0%  [179]        cpu_idle_acpi
>>  100.0%  [179]         cpu_idle
>>   100.0%  [179]          sched_idletd
>>    100.0%  [179]           fork_exit
>>=20
>> 12.92%  [62]       cpu_idle @ /boot/kernel/kernel
>>=20
>> When trying to run pmcstat with the vcxl interfaces enabled the =
system
>> just went to a state of not responding.
>>=20
>> Based on previous scores with Centos 7 (over 3M pps), we can assume =
that
>> it is not the hardware. However, we are still looking for a reason of
>> why are we getting these scores.
>>=20
>> Any feedback or suggestion would be highly appreciated.
>>=20
>> - Jordan
>>=20
>> On 2/9/17 11:34 AM, Navdeep Parhar wrote:
>>> The vcxl interfaces should work under current or 11-STABLE.  Let me =
know
>>> if you run into any trouble when trying to use netmap with cxgbe =
driver.
>>>=20
>>> Regards,
>>> Navdeep
>>>=20
>>> On Thu, Feb 09, 2017 at 10:29:08AM -0500, John Jasen wrote:
>>>> It's not the hardware.
>>>>=20
>>>> Jordan booted up CentOS on the box, and untuned, were able to =
obtain
>>>> over 3 mpps.
>>>>=20
>>>> He has some pmcstat output from freebsd-current, but basically, it
>>>> appears the system spends most of its time looking for a CPU to =
service
>>>> the interrupts and keeps landing on one or two of them, as opposed =
to
>>>> any of the other 16 cores on the physical silicon.
>>>>=20
>>>> We also tried swapping out the T5 card for a Mellanox, tried =
different
>>>> PCIe slots, adjusted cpuset for the low and the high CPUs, no =
matter
>>>> what we try, the results have been bad.
>>>>=20
>>>> Our network test environment is under reconstruction at the moment, =
but
>>>> our plans afterwards are to:
>>>>=20
>>>> a) test netmap-fwd again (the VCXL enabling works under -CURRENT?)
>>>>=20
>>>> b) test without netmap-fwd, and with reduced cores/physical cpus =
(BIOS
>>>> setting)
>>>>=20
>>>> c) potentially, test with netmap-fwd and reduced core count.
>>>>=20
>>>> Any other ideas out there?
>>>>=20
>>>> Thanks!
>>>>=20
>>>>=20
>>>>=20
>>>>=20
>>>>=20
>>>>=20
>>>>=20
>>>>=20
>>>> On 02/05/2017 12:55 PM, Navdeep Parhar wrote:
>>>>> I've been following the email thread on  freebsd-net on this.  The
>>>>> numbers you're getting are well below what the hardware is capable =
of.
>>>>>=20
>>>>> Have you tried netmap-fwd or something that bypasses the kernel?  =
That
>>>>> will be a very quick way to make sure that the hardware is doing =
ok.
>>>>>=20
>>>>> In case you try netmap:
>>>>> cxgbe has virtual interfaces now and those are used for netmap =
(instead
>>>>> of the main interface).  Add this line to /boot/loader.conf and =
you'll
>>>>> see a 'vcxl' interface for every cxl interface.
>>>>> hw.cxgbe.num_vis=3D2
>>>>> It has its own MAC address and can be used like any other =
interface,
>>>>> except it has native netmap support too.  You can run netmap-fwd =
between
>>>>> these vcxl ports.
>>>>>=20
>>>>> Regards,
>>>>> Navdeep
>>>>>=20
>>>>> On Tue, Jan 31, 2017 at 01:57:37PM -0400, Jordan Caraballo wrote:
>>>>>>  Navdeep, Troy,
>>>>>>=20
>>>>>>  I forwarded you this email to see if we could get feedback from =
both of
>>>>>>  you. I talked with Troy during November about
>>>>>>=20
>>>>>>  this R530 system and the use of a 40G Chelsio T-580-CR card. So =
far, we
>>>>>>  have not experienced results above 1.4 million or so.
>>>>>>=20
>>>>>>  Any help would be appreciated.
>>>>>>=20
>>>>>>  - Jordan
>>>>>>=20
>>>>>>  -------- Forwarded Message --------
>>>>>>=20
>>>>>>  Subject: Re: Disappointing packets-per-second performance =
results on a    =20
>>>>>>           Dell,PE R530                                            =
         =20
>>>>>>     Date: Tue, 31 Jan 2017 13:53:15 -0400                         =
         =20
>>>>>>     From: Jordan Caraballo <jordancaraballo87@gmail.com>          =
         =20
>>>>>>       To: Slawa Olhovchenkov <slw@zxy.spb.ru>                     =
         =20
>>>>>>       CC: freebsd-net@freebsd.org                                 =
         =20
>>>>>>=20
>>>>>>  This are the most recent stats. No advances so far. The system =
has
>>>>>>  -Current right now.
>>>>>>=20
>>>>>>  Any help or feedback would be appreciated.
>>>>>>  Hardware Configuration:
>>>>>>  Dell PowerEdge R530 with 2 Intel(R) Xeon(R) E52695 CPU's, 18 =
cores per
>>>>>>  cpu. Equipped with a Chelsio T-580-CR dual port in an 8x slot.
>>>>>>=20
>>>>>>  BIOS tweaks:
>>>>>>  Hyperthreading (or Logical Processors) is turned off.
>>>>>>  loader.conf
>>>>>>  # Chelsio Modules
>>>>>>  t4fw_cfg_load=3D"YES"
>>>>>>  t5fw_cfg_load=3D"YES"
>>>>>>  if_cxgbe_load=3D"YES"
>>>>>>  rc.conf
>>>>>>  # Gateway Configuration
>>>>>>  ifconfig_cxl0=3D"inet 172.16.1.1/24"
>>>>>>  ifconfig_cxl1=3D"inet 172.16.2.1/24"
>>>>>>  gateway_enable=3D"YES"
>>>>>>=20
>>>>>>  Last Results:
>>>>>>  packets errs idrops bytes packets errs bytes colls drops
>>>>>>  2.7M 0 2.0M 1.4G 696k 0 368M 0 0
>>>>>>  2.7M 0 2.0M 1.4G 686k 0 363M 0 0
>>>>>>  2.6M 0 2.0M 1.4G 668k 0 353M 0 0
>>>>>>  2.7M 0 2.0M 1.4G 661k 0 350M 0 0
>>>>>>  2.8M 0 2.1M 1.5G 697k 0 369M 0 0
>>>>>>  2.8M 0 2.1M 1.4G 684k 0 361M 0 0
>>>>>>  2.7M 0 2.1M 1.4G 674k 0 356M 0 0
>>>>>>=20
>>>>>>  root@router1:~ # vmstat -i
>>>>>>=20
>>>>>>  interrupt total rate
>>>>>>  irq9: acpi0 73 0
>>>>>>  irq18: ehci0 ehci1 1155973 3=20
>>>>>>  cpu0:timer 3551157 10
>>>>>>  cpu29:timer 9303048 27
>>>>>>  cpu9:timer 71693455 207
>>>>>>  cpu16:timer 9798380 28
>>>>>>  cpu18:timer 9287094 27
>>>>>>  cpu26:timer 9342495 27
>>>>>>  cpu20:timer 9145888 26
>>>>>>  cpu8:timer 9791228 28
>>>>>>  cpu22:timer 9288116 27
>>>>>>  cpu35:timer 9376578 27
>>>>>>  cpu30:timer 9396294 27
>>>>>>  cpu23:timer 9248760 27
>>>>>>  cpu10:timer 9756455 28
>>>>>>  cpu25:timer 9300202 27
>>>>>>  cpu27:timer 9227291 27
>>>>>>  cpu14:timer 10083548 29
>>>>>>  cpu28:timer 9325684 27
>>>>>>  cpu11:timer 9906405 29
>>>>>>  cpu34:timer 9419170 27
>>>>>>  cpu31:timer 9392089 27
>>>>>>  cpu33:timer 9350540 27
>>>>>>  cpu15:timer 9804551 28
>>>>>>  cpu32:timer 9413182 27
>>>>>>  cpu19:timer 9231505 27
>>>>>>  cpu12:timer 9813506 28
>>>>>>  cpu13:timer 10872130 31
>>>>>>  cpu4:timer 9920237 29
>>>>>>  cpu2:timer 9786498 28
>>>>>>  cpu3:timer 9896011 29
>>>>>>  cpu5:timer 9890207 29
>>>>>>  cpu6:timer 9737869 28
>>>>>>  cpu7:timer 9790119 28
>>>>>>  cpu1:timer 9847913 28
>>>>>>  cpu21:timer 9192561 27
>>>>>>  cpu24:timer 9300259 27
>>>>>>  cpu17:timer 9786186 28
>>>>>>  irq264: mfi0 151818 0
>>>>>>  irq266: bge0 30466 0
>>>>>>  irq272: t5nex0:evt 4 0
>>>>>>  Total 402604945 1161
>>>>>>  top -PHS
>>>>>>  last pid: 18557; load averages: 2.58, 1.90, 0.95 up 4+00:39:54 =
18:30:46
>>>>>>  231 processes: 40 running, 126 sleeping, 65 waiting
>>>>>>  CPU 0: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 1: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 2: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 3: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 4: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 5: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 6: 0.0% user, 0.0% nice, 0.4% system, 0.0% interrupt, 99.6% =
idle
>>>>>>  CPU 7: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 8: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 9: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 10: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 11: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 12: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 13: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 14: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 15: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 16: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 17: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 18: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 19: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 20: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 21: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 22: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 23: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 24: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 25: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 26: 0.0% user, 0.0% nice, 0.0% system, 59.6% interrupt, =
40.4% idle
>>>>>>  CPU 27: 0.0% user, 0.0% nice, 0.0% system, 96.3% interrupt, 3.7% =
idle
>>>>>>  CPU 28: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 29: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 30: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 31: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 32: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 33: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  CPU 34: 0.0% user, 0.0% nice, 0.0% system, 100% interrupt, 0.0% =
idle
>>>>>>  CPU 35: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% =
idle
>>>>>>  Mem: 15M Active, 224M Inact, 1544M Wired, 393M Buf, 29G Free
>>>>>>  Swap: 3881M Total, 3881M Free
>>>>>>=20
>>>>>>  pmcstat -R sample.out -G - | head
>>>>>>  @ CPU_CLK_UNHALTED_CORE [159 samples]
>>>>>>=20
>>>>>>  39.62%  [63]       acpi_cpu_idle_mwait @ /boot/kernel/kernel
>>>>>>   100.0%  [63]        acpi_cpu_idle
>>>>>>    100.0%  [63]         cpu_idle_acpi
>>>>>>     100.0%  [63]          cpu_idle
>>>>>>      100.0%  [63]           sched_idletd
>>>>>>       100.0%  [63]            fork_exit
>>>>>>=20
>>>>>>  17.61%  [28]       cpu_idle @ /boot/kernel/kernel
>>>>>>=20
>>>>>>  root@router1:~ # pmcstat -R sample0.out -G - | head
>>>>>>  @ CPU_CLK_UNHALTED_CORE [750 samples]
>>>>>>=20
>>>>>>  31.60%  [237]      acpi_cpu_idle_mwait @ /boot/kernel/kernel
>>>>>>   100.0%  [237]       acpi_cpu_idle
>>>>>>    100.0%  [237]        cpu_idle_acpi
>>>>>>     100.0%  [237]         cpu_idle
>>>>>>      100.0%  [237]          sched_idletd
>>>>>>       100.0%  [237]           fork_exit
>>>>>>=20
>>>>>>  10.67%  [80]       cpu_idle @ /boot/kernel/kernel
>>>>>>=20
>>>>>>  On 03/01/17 13:46, Slawa Olhovchenkov wrote:
>>>>>>=20
>>>>>> On Tue, Jan 03, 2017 at 12:35:42PM -0400, Jordan Caraballo wrote:
>>>>>>=20
>>>>>>=20
>>>>>> We recently tested a Dell R530 with a Chelsio T580 card, under =
FreeBSD 10.3, 11.0, -STABLE and -CURRENT, and Centos 7.
>>>>>>=20
>>>>>> Based on our research, including netmap-fwd and with the routing =
improvements project (https://wiki.freebsd.org/ProjectsRoutingProposal),
>>>>>> we hoped for packets-per-second (pps) in the 5+ million range, or =
even higher.
>>>>>>=20
>>>>>> Based on prior testing =
(http://marc.info/?t=3D140604252400002&r=3D1&w=3D2), we expected 3-4 =
Million to be easily obtainable.
>>>>>>=20
>>>>>> Unfortunately, our current results top out at no more than 1.5 M =
(64 bytes length packets) with FreeBSD, and
>>>>>> surprisingly around 3.2 M (128 bytes length packets) with Centos =
7, and we are at a loss as to why.
>>>>>>=20
>>>>>> Server Description:
>>>>>> Dell PowerEdge R530 with 2 Intel(R) Xeon(R) E52695 CPU's, 18 =
cores per
>>>>>> cpu. Equipped with a Chelsio T-580-CR dual port in an 8x slot.
>>>>>>=20
>>>>>> ** Can this be a lack in support issue related to the R530's =
hardware? **
>>>>>>=20
>>>>>> Any help appreciated!
>>>>>>=20
>>>>>> What hardware configuration?
>>>>>> What BIOS setting?
>>>>>> What loader.conf/sysctl.conf setting?
>>>>>> What `vmstat -i`?
>>>>>> What `top -PHS`?
>>>>>> what
>>>>>> =3D=3D=3D=3D
>>>>>> pmcstat -S CPU_CLK_UNHALTED_CORE -l 10 -O sample.out
>>>>>> pmcstat -R sample.out -G out.txt
>>>>>> pmcstat -c 0 -S CPU_CLK_UNHALTED_CORE -l 10 -O sample0.out
>>>>>> pmcstat -R sample0.out -G out0.txt
>>>>>> =3D=3D=3D=3D
>>=20
>> _______________________________________________
>> freebsd-net@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to =
"freebsd-net-unsubscribe@freebsd.org"
>=20
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?81FD8C20-B6AA-42A0-899A-62D4DA4A11DC>