FreeBSD Mail Archives

Date:      Sun, 24 Jan 2016 05:00:55 -0200
From:      "Marcus Cenzatti" <cenzatti@hush.com>
To:        "Luigi Rizzo" <rizzo@iet.unipi.it>
Cc:        freebsd-net@freebsd.org, "Navdeep Parhar" <nparhar@gmail.com>
Subject:   Re: solved: Re: Chelsio T520-SO-CR low performance (netmap tested) for RX
Message-ID:  <20160124070056.4EC5CA0126@smtp.hushmail.com>
In-Reply-To: <CA%2BhQ2%2BhxOZkGJdRSrmxSqHforLbMWBVQcayrNFNLLkU803hmjA@mail.gmail.com>
References:  <CA%2BhQ2%2Bg7_haaXLFjMuG00ANsUkFdyGzFQyjT4NYVBmPY-vECBg@mail.gmail.com> <20160124042830.3D674A0128@smtp.hushmail.com> <CA%2BhQ2%2BhxOZkGJdRSrmxSqHforLbMWBVQcayrNFNLLkU803hmjA@mail.gmail.com>



On 1/24/2016 at 3:33 AM, "Luigi Rizzo" <rizzo@iet.unipi.it> wrote:
>
>On Sat, Jan 23, 2016 at 8:28 PM, Marcus Cenzatti 
><cenzatti@hush.com> wrote:
>>
>>
>> On 1/24/2016 at 1:10 AM, "Luigi Rizzo" <rizzo@iet.unipi.it> 
>wrote:
>>>
>>>Thanks for re-running the experiments.
>>>
>>>I am changing the subject so that in the archives it is clear
>>>that the chelsio card works fine.
>>>
>>>Overall the tests confirm that whenever you hit the host stack 
>you
>>>are bound
>>>to the poor performance of the latter. The problem does not 
>appear
>>>using intel
>>>as a receiver because on the intel card netmap mode disables the
>>>host stack.
>>>
>>>More comments on the experiments:
>>>
>>>The only meaningful test is the one where you use the DMAC of the
>>>ncxl0 port:
>>>
>>>    SENDER: ./pkt-gen -i ix0 -f tx -S 00:07:e9:44:d2:ba -D
>>>00:07:43:33:8d:c1
>>>
>>>in the other experiment you transmit broadcast frames and hit the
>>>network stack.
>>>ARP etc do not matter since tx and rx are directly connected.
>>>
>>>On the receiver you do not need to specify addresses:
>>>
>>>    RECEIVER: ./pkt-gen -i ncxl0 -f rx
>>>
>>>The numbers in netstat are clearly rounded, so 15M is probably
>>>14.88M
>>>(line rate), and 3.7M that you see correctly represents the
>>>difference
>>>between incoming and received packets.
>>>
>>>The fact that you see drops may be related to the NIC being 
>unable
>>>to
>>>replenish the queue fast enough, which in turn may be a hardware
>>>or a
>>>software (netmap) issue.
>>>You may try experiment with shorter batches on the receive side
>>>(say, -b 64 or less) and see if you have better results.
>>>
>>>A short batch replenishes the rx queue more frequently, but it is
>>>not a conclusive experiment because there is an optimization in
>>>the netmap poll code which, as an unintended side effect,
>>>replenishes
>>>the queue less often than it should.
>>>For a conclusive experiment you should grab the netmap code from
>>>github.com/luigirizzo/netmap and use pkt-gen-b which
>>>uses busy wait and works around the poll "optimization"
>>>
>>>thanks again for investigating the issue.
>>>
>>>cheers
>>>luigi
>>>
>>
>> so as a summary, with IP test on intel card, netmap disables the 
>host stack while on chelsio netmap does not disable the host stack 
>and we ket things injected to host, so the only reliable test is 
>mac based when using chelsio cards?
>>
>> yes I am already running github's netmap code, let's try with 
>busy code:
>...
>> chelsio# ./pkt-gen-b -i ncxl0 -f rx
>> 785.659290 main [1930] interface is ncxl0
>> 785.659337 main [2050] running on 1 cpus (have 4)
>> 785.659477 extract_ip_range [367] range is 10.0.0.1:0 to 
>10.0.0.1:0
>> 785.659496 extract_ip_range [367] range is 10.1.0.1:0 to 
>10.1.0.1:0
>> 785.718707 main [2148] mapped 334980KB at 0x801800000
>> Receiving from netmap:ncxl0: 2 queues, 1 threads and 1 cpus.
>> 785.718784 main [2235] Wait 2 secs for phy reset
>> 787.729197 main [2237] Ready...
>> 787.729449 receiver_body [1412] reading from netmap:ncxl0 fd 3 
>main_fd 3
>> 788.730089 main_thread [1720] 11.159 Mpps (11.166 Mpkts 5.360 
>Gbps in 1000673 usec) 205.89 avg_batch 0 min_space
>> 789.730588 main_thread [1720] 11.164 Mpps (11.169 Mpkts 5.361 
>Gbps in 1000500 usec) 183.54 avg_batch 0 min_space
>> 790.734224 main_thread [1720] 11.172 Mpps (11.213 Mpkts 5.382 
>Gbps in 1003636 usec) 198.84 avg_batch 0 min_space
>> ^C791.140853 sigint_h [404] received control-C on thread 
>0x801406800
>> 791.742841 main_thread [1720] 4.504 Mpps (4.542 Mpkts 2.180 Gbps 
>in 1008617 usec) 179.62 avg_batch 0 min_space
>> Received 38091031 packets 2285461860 bytes 196774 events 60 
>bytes each in 3.41 seconds.
>> Speed: 11.166 Mpps Bandwidth: 5.360 Gbps (raw 7.504 Gbps). 
>Average batch: 193.58 pkts
>>
>> chelsio# ./pkt-gen-b -b 64 -i ncxl0 -f rx
>> 522.430459 main [1930] interface is ncxl0
>> 522.430507 main [2050] running on 1 cpus (have 4)
>> 522.430644 extract_ip_range [367] range is 10.0.0.1:0 to 
>10.0.0.1:0
>> 522.430662 extract_ip_range [367] range is 10.1.0.1:0 to 
>10.1.0.1:0
>> 522.677743 main [2148] mapped 334980KB at 0x801800000
>> Receiving from netmap:ncxl0: 2 queues, 1 threads and 1 cpus.
>> 522.677822 main [2235] Wait 2 secs for phy reset
>> 524.698114 main [2237] Ready...
>> 524.698373 receiver_body [1412] reading from netmap:ncxl0 fd 3 
>main_fd 3
>> 525.699118 main_thread [1720] 10.958 Mpps (10.966 Mpkts 5.264 
>Gbps in 1000765 usec) 61.84 avg_batch 0 min_space
>> 526.700108 main_thread [1720] 11.086 Mpps (11.097 Mpkts 5.327 
>Gbps in 1000991 usec) 61.06 avg_batch 0 min_space
>> 527.705650 main_thread [1720] 11.166 Mpps (11.227 Mpkts 5.389 
>Gbps in 1005542 usec) 61.91 avg_batch 0 min_space
>> 528.707113 main_thread [1720] 11.090 Mpps (11.107 Mpkts 5.331 
>Gbps in 1001463 usec) 61.34 avg_batch 0 min_space
>> 529.707617 main_thread [1720] 10.847 Mpps (10.853 Mpkts 5.209 
>Gbps in 1000504 usec) 62.51 avg_batch 0 min_space
>> ^C530.556309 sigint_h [404] received control-C on thread 
>0x801406800
>> 530.709133 main_thread [1720] 9.166 Mpps (9.180 Mpkts 4.406 Gbps 
>in 1001516 usec) 62.92 avg_batch 0 min_space
>> Received 64430028 packets 3865801680 bytes 1041000 events 60 
>bytes each in 5.86 seconds.
>> Speed: 10.999 Mpps Bandwidth: 5.279 Gbps (raw 7.391 Gbps). 
>Average batch: 61.89 pkts
>...
>
>> so, the lower the batch the smaller performance.
>>
>> did you expect some other behaviour?
>
>
>for very small batches, yes.
>For larger batch sizes I was hoping that refilling the ring more 
>often
>could reduce losses.
>
>One last attempt: try use -l 64 on the sender, this will generate 
>64+4 byte
>packets, which may become just 64 on the receiver if the chelsio 
>is configured
>to strip the CRC. This should result in well aligned PCIe 
>transactions and
>reduced PCIe traffic, which may help (the ix driver has a similar 
>problem,
>but since it does not strip the CRC can rx at line rate with 60 
>bytes but not
>with 64).
>
>If this does not help we should ask Navdeep if he knows what the 
>NIC
>is capable of.
>
>cheers
>luigi

ok here it is

this lowered pps rate to 9.4Mpps on chelsio (we had 11Mpps with defaul len) and lowered rates to 14Mpps on sender (we had 14.8Mpps before).

intel# netmap-master/examples/pkt-gen-b -l 64 -i ix0 -f tx -S 00:07:e9:44:d2:ba -D 00:07:43:33:8d:c1
071.547164 main [1930] interface is ix0
071.547203 main [2050] running on 1 cpus (have 8)
071.547222 extract_ip_range [367] range is 10.0.0.1:0 to 10.0.0.1:0
071.547232 extract_ip_range [367] range is 10.1.0.1:0 to 10.1.0.1:0
071.652590 main [2148] mapped 334980KB at 0x801800000
Sending on netmap:ix0: 8 queues, 1 threads and 1 cpus.
10.0.0.1 -> 10.1.0.1 (00:07:e9:44:d2:ba -> 00:07:43:33:8d:c1)
071.652665 main [2233] Sending 512 packets every  0.000000000 s
071.652670 main [2235] Wait 2 secs for phy reset
073.662663 main [2237] Ready...
073.662902 sender_body [1211] start, fd 3 main_fd 3
073.702057 sender_body [1293] drop copy
074.664337 main_thread [1720] 13.627 Mpps (13.646 Mpkts 6.987 Gbps in 1001450 usec) 423.69 avg_batch 0 min_space
(...)
113.849333 main_thread [1720] 14.133 Mpps (14.147 Mpkts 7.243 Gbps in 1001001 usec) 412.94 avg_batch 99999 min_space
^C114.141717 sigint_h [404] received control-C on thread 0x801406800
114.141968 sender_body [1326] flush tail 360 head 1512 on thread 0x801406800
114.142212 sender_body [1334] pending tx tail 99 head 612 on ring 0
114.142267 sender_body [1334] pending tx tail 1000 head 1050 on ring 4
114.142299 sender_body [1334] pending tx tail 115 head 125 on ring 5
114.857560 main_thread [1720] 4.057 Mpps (4.090 Mpkts 2.094 Gbps in 1008227 usec) 421.75 avg_batch 99999 min_space
Sent 570189473 packets 36492126272 bytes 1429432 events 64 bytes each in 40.48 seconds.
Speed: 14.086 Mpps Bandwidth: 7.212 Gbps (raw 9.917 Gbps). Average batch: 398.89 pkts

chelsio# ./pkt-gen-b  -i ncxl0 -f rx
149.133685 main [1930] interface is ncxl0
149.133732 main [2050] running on 1 cpus (have 4)
149.133870 extract_ip_range [367] range is 10.0.0.1:0 to 10.0.0.1:0
149.133889 extract_ip_range [367] range is 10.1.0.1:0 to 10.1.0.1:0
149.192708 main [2148] mapped 334980KB at 0x801800000
Receiving from netmap:ncxl0: 2 queues, 1 threads and 1 cpus.
149.192783 main [2235] Wait 2 secs for phy reset
151.296102 main [2237] Ready...
151.296365 receiver_body [1412] reading from netmap:ncxl0 fd 3 main_fd 3
(..)
187.589094 main_thread [1720] 3.386 Mpps (3.396 Mpkts 1.739 Gbps in 1003011 usec) 174.46 avg_batch 0 min_space
188.590084 main_thread [1720] 9.470 Mpps (9.480 Mpkts 4.854 Gbps in 1000990 usec) 149.61 avg_batch 0 min_space
189.590591 main_thread [1720] 9.466 Mpps (9.471 Mpkts 4.849 Gbps in 1000507 usec) 164.13 avg_batch 0 min_space
190.608590 main_thread [1720] 9.470 Mpps (9.640 Mpkts 4.936 Gbps in 1017999 usec) 144.30 avg_batch 0 min_space
191.609584 main_thread [1720] 9.471 Mpps (9.481 Mpkts 4.854 Gbps in 1000994 usec) 158.99 avg_batch 0 min_space
192.610594 main_thread [1720] 9.470 Mpps (9.480 Mpkts 4.854 Gbps in 1001010 usec) 148.97 avg_batch 0 min_space
193.614357 main_thread [1720] 9.471 Mpps (9.507 Mpkts 4.867 Gbps in 1003763 usec) 168.09 avg_batch 0 min_space
194.614582 main_thread [1720] 9.469 Mpps (9.471 Mpkts 4.849 Gbps in 1000225 usec) 160.39 avg_batch 0 min_space
195.615590 main_thread [1720] 9.470 Mpps (9.480 Mpkts 4.854 Gbps in 1001008 usec) 151.97 avg_batch 0 min_space
196.617080 main_thread [1720] 9.471 Mpps (9.485 Mpkts 4.856 Gbps in 1001490 usec) 171.75 avg_batch 0 min_space
197.618083 main_thread [1720] 9.471 Mpps (9.480 Mpkts 4.854 Gbps in 1001003 usec) 164.99 avg_batch 0 min_space
198.619718 main_thread [1720] 9.471 Mpps (9.486 Mpkts 4.857 Gbps in 1001636 usec) 153.07 avg_batch 0 min_space
199.620607 main_thread [1720] 9.467 Mpps (9.476 Mpkts 4.852 Gbps in 1000888 usec) 153.94 avg_batch 0 min_space
200.622081 main_thread [1720] 9.471 Mpps (9.485 Mpkts 4.856 Gbps in 1001474 usec) 161.03 avg_batch 0 min_space
201.622582 main_thread [1720] 9.471 Mpps (9.476 Mpkts 4.852 Gbps in 1000501 usec) 168.47 avg_batch 0 min_space
202.632223 main_thread [1720] 9.470 Mpps (9.561 Mpkts 4.895 Gbps in 1009641 usec) 145.45 avg_batch 0 min_space
203.633077 main_thread [1720] 9.471 Mpps (9.479 Mpkts 4.853 Gbps in 1000854 usec) 170.21 avg_batch 0 min_space
204.633586 main_thread [1720] 9.467 Mpps (9.472 Mpkts 4.850 Gbps in 1000509 usec) 160.41 avg_batch 0 min_space
205.640364 main_thread [1720] 9.471 Mpps (9.535 Mpkts 4.882 Gbps in 1006778 usec) 171.47 avg_batch 0 min_space
206.641075 main_thread [1720] 9.471 Mpps (9.478 Mpkts 4.853 Gbps in 1000711 usec) 169.56 avg_batch 0 min_space
207.642079 main_thread [1720] 9.471 Mpps (9.480 Mpkts 4.854 Gbps in 1001004 usec) 158.71 avg_batch 0 min_space
208.642581 main_thread [1720] 9.471 Mpps (9.475 Mpkts 4.851 Gbps in 1000502 usec) 172.57 avg_batch 0 min_space
^C208.909765 sigint_h [404] received control-C on thread 0x801406800
209.650597 main_thread [1720] 2.510 Mpps (2.530 Mpkts 1.296 Gbps in 1008016 usec) 172.19 avg_batch 0 min_space
Received 205307453 packets 13139666672 bytes 1283636 events 63 bytes each in 57.61 seconds.
Speed: 9.364 Mpps Bandwidth: 8.825 Gbps (raw 9.509 Gbps). Average batch: 159.94 pkts

this is repeatable, 11Mpps/RX <== 14.8Mpps/TX without -l 64 and 9.4Mpps <== 14Mpps with -l 64 for all testing sessions executed

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20160124070056.4EC5CA0126>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation