Date: Mon, 14 Jul 2014 12:03:54 -0700 From: Navdeep Parhar <nparhar@gmail.com> To: John Jasem <jjasen@gmail.com>, FreeBSD Net <freebsd-net@freebsd.org> Subject: Re: tuning routing using cxgbe and T580-CR cards? Message-ID: <53C4299A.3000900@gmail.com> In-Reply-To: <53C3EFDC.2030100@gmail.com> References: <53C01EB5.6090701@gmail.com> <53C03BB4.2090203@gmail.com> <53C3EFDC.2030100@gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Use UDP if you want more control over your experiments. - It's easier to directly control the frame size on the wire. No TSO, LRO, segmentation to worry about. - UDP has no flow control so the transmitters will not let up even if a frame goes missing. TCP will go into recovery. Lack of protocol level flow control also means the transmitters cannot be influenced by the receivers in any way. - frames go only in the direction you want them to. With TCP you have the receiver transmitting all the time too (ACKs). Regards, Navdeep On 07/14/14 07:57, John Jasem wrote: > The two physical CPUs are: > Intel(R) Xeon(R) CPU E5-4610 0 @ 2.40GHz (2400.05-MHz K8-class CPU) > > Hyperthreading, at least from initial appearances, seems to offer no > benefits or drawbacks. > > I tested iperf3, using a packet generator on each subnet, each sending 4 > streams to a server on another subnet. > > maximum segment size of 128 and 1460 used, with little variance. (iperf3 > -M). > > A snapshot of netstat -d -b -w1 -W -h included. Midway through, the > numbers dropped. This coincides with launching this was when I launched > 16 more streams, 4 new clients, 4 new servers on different nets, 4 > streams each. > > input (Total) output > packets errs idrops bytes packets errs bytes colls drops > 1.6M 0 514 254M 1.6M 0 252M 0 5 > 1.6M 0 294 244M 1.6M 0 246M 0 6 > 1.6M 0 95 255M 1.5M 0 236M 0 6 > 1.4M 0 0 216M 1.5M 0 224M 0 3 > 1.5M 0 0 225M 1.4M 0 219M 0 4 > 1.4M 0 389 214M 1.4M 0 216M 0 1 > 1.4M 0 270 207M 1.4M 0 207M 0 1 > 1.4M 0 279 210M 1.4M 0 209M 0 2 > 1.4M 0 12 207M 1.3M 0 204M 0 1 > 1.4M 0 303 206M 1.4M 0 214M 0 2 > 1.3M 0 2.3K 190M 1.4M 0 212M 0 1 > 1.1M 0 1.1K 175M 1.1M 0 176M 0 1 > 1.1M 0 1.6K 176M 1.1M 0 175M 0 1 > 1.1M 0 830 176M 1.1M 0 174M 0 0 > 1.2M 0 1.5K 187M 1.2M 0 187M 0 0 > 1.2M 0 1.1K 183M 1.2M 0 184M 0 1 > 1.2M 0 1.5K 197M 1.2M 0 196M 0 2 > 1.3M 0 2.2K 199M 1.2M 0 196M 0 0 > 1.3M 0 2.8K 200M 1.3M 0 202M 0 4 > 1.3M 0 1.5K 199M 1.2M 0 198M 0 1 > > > vmstat also included. You see similar drops in faults. > > > procs memory page disks faults cpu > r b w avm fre flt re pi po fr sr mf0 cd0 in sy cs > us sy id > 0 0 0 574M 15G 82 0 0 0 0 6 0 0 188799 224 > 387419 0 74 26 > 0 0 0 574M 15G 2 0 0 0 0 6 0 0 207447 150 > 425576 0 72 28 > 0 0 0 574M 15G 82 0 0 0 0 6 0 0 205638 202 > 421659 0 75 25 > 0 0 0 574M 15G 2 0 0 0 0 6 0 0 200292 150 > 411257 0 74 26 > 0 0 0 574M 15G 82 0 0 0 0 6 0 0 200338 197 > 411537 0 77 23 > 0 0 0 574M 15G 2 0 0 0 0 6 0 0 199289 156 > 409092 0 75 25 > 0 0 0 574M 15G 82 0 0 0 0 6 0 0 200504 200 > 411992 0 76 24 > 0 0 0 574M 15G 2 0 0 0 0 6 0 0 165042 152 > 341207 0 78 22 > 0 0 0 574M 15G 82 0 0 0 0 6 0 0 171360 200 > 353776 0 78 22 > 0 0 0 574M 15G 2 0 0 0 0 6 0 0 197557 150 > 405937 0 74 26 > 0 0 0 574M 15G 82 0 0 0 0 6 0 0 170696 204 > 353197 0 78 22 > 0 0 0 574M 15G 2 0 0 0 0 6 0 0 174927 150 > 361171 0 77 23 > 0 0 0 574M 15G 82 0 0 0 0 6 0 0 153836 200 > 319227 0 79 21 > 0 0 0 574M 15G 2 0 0 0 0 6 0 0 159056 150 > 329517 0 78 22 > 0 0 0 574M 15G 82 0 0 0 0 6 0 0 155240 200 > 321819 0 78 22 > 0 0 0 574M 15G 2 0 0 0 0 6 0 0 166422 156 > 344184 0 78 22 > 0 0 0 574M 15G 82 0 0 0 0 6 0 0 162065 200 > 335215 0 79 21 > 0 0 0 574M 15G 2 0 0 0 0 6 0 0 172857 150 > 356852 0 78 22 > 0 0 0 574M 15G 82 0 0 0 0 6 0 0 81267 197 > 176539 0 92 8 > 0 0 0 574M 15G 2 0 0 0 0 6 0 0 82151 150 > 177434 0 91 9 > 0 0 0 574M 15G 82 0 0 0 0 6 0 0 73904 204 > 160887 0 91 9 > 0 0 0 574M 15G 2 0 0 0 8 6 0 0 73820 150 > 161201 0 91 9 > 0 0 0 574M 15G 82 0 0 0 0 6 0 0 73926 196 > 161850 0 92 8 > 0 0 0 574M 15G 2 0 0 0 0 6 0 0 77215 150 > 166886 0 91 9 > 0 0 0 574M 15G 82 0 0 0 0 6 0 0 77509 198 > 169650 0 91 9 > 0 0 0 574M 15G 2 0 0 0 0 6 0 0 69993 156 > 154783 0 90 10 > 0 0 0 574M 15G 82 0 0 0 0 6 0 0 69722 199 > 153525 0 91 9 > 0 0 0 574M 15G 2 0 0 0 0 6 0 0 66353 150 > 147027 0 91 9 > 0 0 0 550M 15G 102 0 0 0 101 6 0 0 67906 259 > 149365 0 90 10 > 0 0 0 550M 15G 0 0 0 0 0 6 0 0 71837 125 > 157253 0 92 8 > 0 0 0 550M 15G 80 0 0 0 0 6 0 0 73508 179 > 161498 0 92 8 > 0 0 0 550M 15G 0 0 0 0 0 6 0 0 72673 125 > 159449 0 92 8 > 0 0 0 550M 15G 80 0 0 0 0 6 0 0 75630 175 > 164614 0 91 9 > > > > > On 07/11/2014 03:32 PM, Navdeep Parhar wrote: >> On 07/11/14 10:28, John Jasem wrote: >>> In testing two Chelsio T580-CR dual port cards with FreeBSD 10-STABLE, >>> I've been able to use a collection of clients to generate approximately >>> 1.5-1.6 million TCP packets per second sustained, and routinely hit >>> 10GB/s, both measured by netstat -d -b -w1 -W (I usually use -h for the >>> quick read, accepting the loss of granularity). >> When forwarding, the pps rate is often more interesting, and almost >> always the limiting factor, as compared to the total amount of data >> being passed around. 10GB at this pps probably means 9000 MTU. Try >> with 1500 too if possible. >> >> "netstat -d 1" and "vmstat 1" for a few seconds when your system is >> under maximum load would be useful. And what kind of CPU is in this system? >> >>> While performance has so far been stellar, and I'm honestly speculating >>> I will need more CPU depth and horsepower to get much faster, I'm >>> curious if there is any gain to tweaking performance settings. I'm >>> seeing, under multiple streams, with N targets connecting to N servers, >>> interrupts on all CPUs peg at 99-100%, and I'm curious if tweaking >>> configs will help, or its a free clue to get more horsepower. >>> >>> So, far, except for temporarily turning off pflogd, and setting the >>> following sysctl variables, I've not done any performance tuning on the >>> system yet. >>> >>> /etc/sysctl.conf >>> net.inet.ip.fastforwarding=1 >>> kern.random.sys.harvest.ethernet=0 >>> kern.random.sys.harvest.point_to_point=0 >>> kern.random.sys.harvest.interrupt=0 >>> >>> a) One of the first things I did in prior testing was to turn >>> hyperthreading off. I presume this is still prudent, as HT doesn't help >>> with interrupt handling? >> It is always worthwhile to try your workload with and without >> hyperthreading. >> >>> b) I briefly experimented with using cpuset(1) to stick interrupts to >>> physical CPUs, but it offered no performance enhancements, and indeed, >>> appeared to decrease performance by 10-20%. Has anyone else tried this? >>> What were your results? >>> >>> c) the defaults for the cxgbe driver appear to be 8 rx queues, and N tx >>> queues, with N being the number of CPUs detected. For a system running >>> multiple cards, routing or firewalling, does this make sense, or would >>> balancing tx and rx be more ideal? And would reducing queues per card >>> based on NUMBER-CPUS and NUM-CHELSIO-PORTS make sense at all? >> The defaults are nrxq = min(8, ncores) and ntxq = min(16, ncores). The >> man page mentions this. The reason for 8 vs. 16 is that tx queues are >> "cheaper" as they don't have to be backed by rx buffers. It only needs >> some memory for the tx descriptor ring and some hardware resources. >> >> It appears that your system has >= 16 cores. For forwarding it probably >> makes sense to have nrxq = ntxq. If you're left with 8 or fewer cores >> after disabling hyperthreading you'll automatically get 8 rx and tx >> queues. Otherwise you'll have to fiddle with the hw.cxgbe.nrxq10g and >> ntxq10g tunables (documented in the man page). >> >> >>> d) dev.cxl.$PORT.qsize_rxq: 1024 and dev.cxl.$PORT.qsize_txq: 1024. >>> These appear to not be writeable when if_cxgbe is loaded, so I speculate >>> they are not to be messed with, or are loader.conf variables? Is there >>> any benefit to messing with them? >> Can't change them after the port has been administratively brought up >> even once. This is mentioned in the man page. I don't really recommend >> changing them any way. >> >>> e) dev.t5nex.$CARD.toe.sndbuf: 262144. These are writeable, but messing >>> with values did not yield an immediate benefit. Am I barking up the >>> wrong tree, trying? >> The TOE tunables won't make a difference unless you have enabled TOE, >> the TCP endpoints lie on the system, and the connections are being >> handled by the TOE on the chip. This is not the case on your systems. >> The driver does not enable TOE by default and the only way to use it is >> to switch it on explicitly. There is no possibility that you're using >> it without knowing that you are. >> >>> f) based on prior experiments with other vendors, I tried tweaks to >>> net.isr.* settings, but did not see any benefits worth discussing. Am I >>> correct in this speculation, based on others experience? >>> >>> g) Are there other settings I should be looking at, that may squeeze out >>> a few more packets? >> The pps rates that you've observed are within the chip's hardware limits >> by at least an order of magnitude. Tuning the kernel rather than the >> driver may be the best bang for your buck. >> >> Regards, >> Navdeep >> >>> Thanks in advance! >>> >>> -- John Jasen (jjasen@gmail.com) >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?53C4299A.3000900>