Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 23 Jan 2016 22:42:17 -0800
From:      Navdeep Parhar <nparhar@gmail.com>
To:        Luigi Rizzo <rizzo@iet.unipi.it>
Cc:        Marcus Cenzatti <cenzatti@hush.com>, "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>
Subject:   Re: solved: Re: Chelsio T520-SO-CR low performance (netmap tested) for RX
Message-ID:  <20160124064217.GB7567@ox>
In-Reply-To: <CA%2BhQ2%2BhxOZkGJdRSrmxSqHforLbMWBVQcayrNFNLLkU803hmjA@mail.gmail.com>
References:  <CA%2BhQ2%2Bg7_haaXLFjMuG00ANsUkFdyGzFQyjT4NYVBmPY-vECBg@mail.gmail.com> <20160124042830.3D674A0128@smtp.hushmail.com> <CA%2BhQ2%2BhxOZkGJdRSrmxSqHforLbMWBVQcayrNFNLLkU803hmjA@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Jan 23, 2016 at 09:33:32PM -0800, Luigi Rizzo wrote:
> On Sat, Jan 23, 2016 at 8:28 PM, Marcus Cenzatti <cenzatti@hush.com> wrote:
> >
> >
> > On 1/24/2016 at 1:10 AM, "Luigi Rizzo" <rizzo@iet.unipi.it> wrote:
> >>
> >>Thanks for re-running the experiments.
> >>
> >>I am changing the subject so that in the archives it is clear
> >>that the chelsio card works fine.
> >>
> >>Overall the tests confirm that whenever you hit the host stack you
> >>are bound
> >>to the poor performance of the latter. The problem does not appear
> >>using intel
> >>as a receiver because on the intel card netmap mode disables the
> >>host stack.
> >>
> >>More comments on the experiments:
> >>
> >>The only meaningful test is the one where you use the DMAC of the
> >>ncxl0 port:
> >>
> >>    SENDER: ./pkt-gen -i ix0 -f tx -S 00:07:e9:44:d2:ba -D
> >>00:07:43:33:8d:c1
> >>
> >>in the other experiment you transmit broadcast frames and hit the
> >>network stack.
> >>ARP etc do not matter since tx and rx are directly connected.
> >>
> >>On the receiver you do not need to specify addresses:
> >>
> >>    RECEIVER: ./pkt-gen -i ncxl0 -f rx
> >>
> >>The numbers in netstat are clearly rounded, so 15M is probably
> >>14.88M
> >>(line rate), and 3.7M that you see correctly represents the
> >>difference
> >>between incoming and received packets.
> >>
> >>The fact that you see drops may be related to the NIC being unable
> >>to
> >>replenish the queue fast enough, which in turn may be a hardware
> >>or a
> >>software (netmap) issue.
> >>You may try experiment with shorter batches on the receive side
> >>(say, -b 64 or less) and see if you have better results.
> >>
> >>A short batch replenishes the rx queue more frequently, but it is
> >>not a conclusive experiment because there is an optimization in
> >>the netmap poll code which, as an unintended side effect,
> >>replenishes
> >>the queue less often than it should.
> >>For a conclusive experiment you should grab the netmap code from
> >>github.com/luigirizzo/netmap and use pkt-gen-b which
> >>uses busy wait and works around the poll "optimization"
> >>
> >>thanks again for investigating the issue.
> >>
> >>cheers
> >>luigi
> >>
> >
> > so as a summary, with IP test on intel card, netmap disables the host stack while on chelsio netmap does not disable the host stack and we ket things injected to host, so the only reliable test is mac based when using chelsio cards?
> >
> > yes I am already running github's netmap code, let's try with busy code:
> ...
> > chelsio# ./pkt-gen-b -i ncxl0 -f rx
> > 785.659290 main [1930] interface is ncxl0
> > 785.659337 main [2050] running on 1 cpus (have 4)
> > 785.659477 extract_ip_range [367] range is 10.0.0.1:0 to 10.0.0.1:0
> > 785.659496 extract_ip_range [367] range is 10.1.0.1:0 to 10.1.0.1:0
> > 785.718707 main [2148] mapped 334980KB at 0x801800000
> > Receiving from netmap:ncxl0: 2 queues, 1 threads and 1 cpus.
> > 785.718784 main [2235] Wait 2 secs for phy reset
> > 787.729197 main [2237] Ready...
> > 787.729449 receiver_body [1412] reading from netmap:ncxl0 fd 3 main_fd 3
> > 788.730089 main_thread [1720] 11.159 Mpps (11.166 Mpkts 5.360 Gbps in 1000673 usec) 205.89 avg_batch 0 min_space
> > 789.730588 main_thread [1720] 11.164 Mpps (11.169 Mpkts 5.361 Gbps in 1000500 usec) 183.54 avg_batch 0 min_space
> > 790.734224 main_thread [1720] 11.172 Mpps (11.213 Mpkts 5.382 Gbps in 1003636 usec) 198.84 avg_batch 0 min_space
> > ^C791.140853 sigint_h [404] received control-C on thread 0x801406800
> > 791.742841 main_thread [1720] 4.504 Mpps (4.542 Mpkts 2.180 Gbps in 1008617 usec) 179.62 avg_batch 0 min_space
> > Received 38091031 packets 2285461860 bytes 196774 events 60 bytes each in 3.41 seconds.
> > Speed: 11.166 Mpps Bandwidth: 5.360 Gbps (raw 7.504 Gbps). Average batch: 193.58 pkts
> >
> > chelsio# ./pkt-gen-b -b 64 -i ncxl0 -f rx
> > 522.430459 main [1930] interface is ncxl0
> > 522.430507 main [2050] running on 1 cpus (have 4)
> > 522.430644 extract_ip_range [367] range is 10.0.0.1:0 to 10.0.0.1:0
> > 522.430662 extract_ip_range [367] range is 10.1.0.1:0 to 10.1.0.1:0
> > 522.677743 main [2148] mapped 334980KB at 0x801800000
> > Receiving from netmap:ncxl0: 2 queues, 1 threads and 1 cpus.
> > 522.677822 main [2235] Wait 2 secs for phy reset
> > 524.698114 main [2237] Ready...
> > 524.698373 receiver_body [1412] reading from netmap:ncxl0 fd 3 main_fd 3
> > 525.699118 main_thread [1720] 10.958 Mpps (10.966 Mpkts 5.264 Gbps in 1000765 usec) 61.84 avg_batch 0 min_space
> > 526.700108 main_thread [1720] 11.086 Mpps (11.097 Mpkts 5.327 Gbps in 1000991 usec) 61.06 avg_batch 0 min_space
> > 527.705650 main_thread [1720] 11.166 Mpps (11.227 Mpkts 5.389 Gbps in 1005542 usec) 61.91 avg_batch 0 min_space
> > 528.707113 main_thread [1720] 11.090 Mpps (11.107 Mpkts 5.331 Gbps in 1001463 usec) 61.34 avg_batch 0 min_space
> > 529.707617 main_thread [1720] 10.847 Mpps (10.853 Mpkts 5.209 Gbps in 1000504 usec) 62.51 avg_batch 0 min_space
> > ^C530.556309 sigint_h [404] received control-C on thread 0x801406800
> > 530.709133 main_thread [1720] 9.166 Mpps (9.180 Mpkts 4.406 Gbps in 1001516 usec) 62.92 avg_batch 0 min_space
> > Received 64430028 packets 3865801680 bytes 1041000 events 60 bytes each in 5.86 seconds.
> > Speed: 10.999 Mpps Bandwidth: 5.279 Gbps (raw 7.391 Gbps). Average batch: 61.89 pkts
> ...
> 
> > so, the lower the batch the smaller performance.
> >
> > did you expect some other behaviour?
> 
> 
> for very small batches, yes.
> For larger batch sizes I was hoping that refilling the ring more often
> could reduce losses.
> 
> One last attempt: try use -l 64 on the sender, this will generate 64+4 byte
> packets, which may become just 64 on the receiver if the chelsio is configured
> to strip the CRC. This should result in well aligned PCIe transactions and
> reduced PCIe traffic, which may help (the ix driver has a similar problem,
> but since it does not strip the CRC can rx at line rate with 60 bytes but not
> with 64).

Keep hw.cxgbe.fl_pktshift in mind for these kind of tests.  The default
value is 2 so the chip DMAs payload at an offset of 2B from the start of
the rx buffer.  So you'll need to adjust your frame size by 2 (66B on
the wire, 62B after CRC is removed, making it exactly 64B across PCIe if
pktshift is 2) or just set hw.cxgbe.fl_pktshift=0 in /boot/loader.conf.

Regards,
Navdeep



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20160124064217.GB7567>