Date: Mon, 14 Jul 2014 10:57:32 -0400 From: John Jasem <jjasen@gmail.com> To: Navdeep Parhar <nparhar@gmail.com>, FreeBSD Net <freebsd-net@freebsd.org> Subject: Re: tuning routing using cxgbe and T580-CR cards? Message-ID: <53C3EFDC.2030100@gmail.com> In-Reply-To: <53C03BB4.2090203@gmail.com> References: <53C01EB5.6090701@gmail.com> <53C03BB4.2090203@gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
The two physical CPUs are: Intel(R) Xeon(R) CPU E5-4610 0 @ 2.40GHz (2400.05-MHz K8-class CPU) Hyperthreading, at least from initial appearances, seems to offer no benefits or drawbacks. I tested iperf3, using a packet generator on each subnet, each sending 4 streams to a server on another subnet. maximum segment size of 128 and 1460 used, with little variance. (iperf3 -M). A snapshot of netstat -d -b -w1 -W -h included. Midway through, the numbers dropped. This coincides with launching this was when I launched 16 more streams, 4 new clients, 4 new servers on different nets, 4 streams each. input (Total) output packets errs idrops bytes packets errs bytes colls drops 1.6M 0 514 254M 1.6M 0 252M 0 5 1.6M 0 294 244M 1.6M 0 246M 0 6 1.6M 0 95 255M 1.5M 0 236M 0 6 1.4M 0 0 216M 1.5M 0 224M 0 3 1.5M 0 0 225M 1.4M 0 219M 0 4 1.4M 0 389 214M 1.4M 0 216M 0 1 1.4M 0 270 207M 1.4M 0 207M 0 1 1.4M 0 279 210M 1.4M 0 209M 0 2 1.4M 0 12 207M 1.3M 0 204M 0 1 1.4M 0 303 206M 1.4M 0 214M 0 2 1.3M 0 2.3K 190M 1.4M 0 212M 0 1 1.1M 0 1.1K 175M 1.1M 0 176M 0 1 1.1M 0 1.6K 176M 1.1M 0 175M 0 1 1.1M 0 830 176M 1.1M 0 174M 0 0 1.2M 0 1.5K 187M 1.2M 0 187M 0 0 1.2M 0 1.1K 183M 1.2M 0 184M 0 1 1.2M 0 1.5K 197M 1.2M 0 196M 0 2 1.3M 0 2.2K 199M 1.2M 0 196M 0 0 1.3M 0 2.8K 200M 1.3M 0 202M 0 4 1.3M 0 1.5K 199M 1.2M 0 198M 0 1 vmstat also included. You see similar drops in faults. procs memory page disks faults cpu r b w avm fre flt re pi po fr sr mf0 cd0 in sy cs us sy id 0 0 0 574M 15G 82 0 0 0 0 6 0 0 188799 224 387419 0 74 26 0 0 0 574M 15G 2 0 0 0 0 6 0 0 207447 150 425576 0 72 28 0 0 0 574M 15G 82 0 0 0 0 6 0 0 205638 202 421659 0 75 25 0 0 0 574M 15G 2 0 0 0 0 6 0 0 200292 150 411257 0 74 26 0 0 0 574M 15G 82 0 0 0 0 6 0 0 200338 197 411537 0 77 23 0 0 0 574M 15G 2 0 0 0 0 6 0 0 199289 156 409092 0 75 25 0 0 0 574M 15G 82 0 0 0 0 6 0 0 200504 200 411992 0 76 24 0 0 0 574M 15G 2 0 0 0 0 6 0 0 165042 152 341207 0 78 22 0 0 0 574M 15G 82 0 0 0 0 6 0 0 171360 200 353776 0 78 22 0 0 0 574M 15G 2 0 0 0 0 6 0 0 197557 150 405937 0 74 26 0 0 0 574M 15G 82 0 0 0 0 6 0 0 170696 204 353197 0 78 22 0 0 0 574M 15G 2 0 0 0 0 6 0 0 174927 150 361171 0 77 23 0 0 0 574M 15G 82 0 0 0 0 6 0 0 153836 200 319227 0 79 21 0 0 0 574M 15G 2 0 0 0 0 6 0 0 159056 150 329517 0 78 22 0 0 0 574M 15G 82 0 0 0 0 6 0 0 155240 200 321819 0 78 22 0 0 0 574M 15G 2 0 0 0 0 6 0 0 166422 156 344184 0 78 22 0 0 0 574M 15G 82 0 0 0 0 6 0 0 162065 200 335215 0 79 21 0 0 0 574M 15G 2 0 0 0 0 6 0 0 172857 150 356852 0 78 22 0 0 0 574M 15G 82 0 0 0 0 6 0 0 81267 197 176539 0 92 8 0 0 0 574M 15G 2 0 0 0 0 6 0 0 82151 150 177434 0 91 9 0 0 0 574M 15G 82 0 0 0 0 6 0 0 73904 204 160887 0 91 9 0 0 0 574M 15G 2 0 0 0 8 6 0 0 73820 150 161201 0 91 9 0 0 0 574M 15G 82 0 0 0 0 6 0 0 73926 196 161850 0 92 8 0 0 0 574M 15G 2 0 0 0 0 6 0 0 77215 150 166886 0 91 9 0 0 0 574M 15G 82 0 0 0 0 6 0 0 77509 198 169650 0 91 9 0 0 0 574M 15G 2 0 0 0 0 6 0 0 69993 156 154783 0 90 10 0 0 0 574M 15G 82 0 0 0 0 6 0 0 69722 199 153525 0 91 9 0 0 0 574M 15G 2 0 0 0 0 6 0 0 66353 150 147027 0 91 9 0 0 0 550M 15G 102 0 0 0 101 6 0 0 67906 259 149365 0 90 10 0 0 0 550M 15G 0 0 0 0 0 6 0 0 71837 125 157253 0 92 8 0 0 0 550M 15G 80 0 0 0 0 6 0 0 73508 179 161498 0 92 8 0 0 0 550M 15G 0 0 0 0 0 6 0 0 72673 125 159449 0 92 8 0 0 0 550M 15G 80 0 0 0 0 6 0 0 75630 175 164614 0 91 9 On 07/11/2014 03:32 PM, Navdeep Parhar wrote: > On 07/11/14 10:28, John Jasem wrote: >> In testing two Chelsio T580-CR dual port cards with FreeBSD 10-STABLE, >> I've been able to use a collection of clients to generate approximately >> 1.5-1.6 million TCP packets per second sustained, and routinely hit >> 10GB/s, both measured by netstat -d -b -w1 -W (I usually use -h for the >> quick read, accepting the loss of granularity). > When forwarding, the pps rate is often more interesting, and almost > always the limiting factor, as compared to the total amount of data > being passed around. 10GB at this pps probably means 9000 MTU. Try > with 1500 too if possible. > > "netstat -d 1" and "vmstat 1" for a few seconds when your system is > under maximum load would be useful. And what kind of CPU is in this system? > >> While performance has so far been stellar, and I'm honestly speculating >> I will need more CPU depth and horsepower to get much faster, I'm >> curious if there is any gain to tweaking performance settings. I'm >> seeing, under multiple streams, with N targets connecting to N servers, >> interrupts on all CPUs peg at 99-100%, and I'm curious if tweaking >> configs will help, or its a free clue to get more horsepower. >> >> So, far, except for temporarily turning off pflogd, and setting the >> following sysctl variables, I've not done any performance tuning on the >> system yet. >> >> /etc/sysctl.conf >> net.inet.ip.fastforwarding=1 >> kern.random.sys.harvest.ethernet=0 >> kern.random.sys.harvest.point_to_point=0 >> kern.random.sys.harvest.interrupt=0 >> >> a) One of the first things I did in prior testing was to turn >> hyperthreading off. I presume this is still prudent, as HT doesn't help >> with interrupt handling? > It is always worthwhile to try your workload with and without > hyperthreading. > >> b) I briefly experimented with using cpuset(1) to stick interrupts to >> physical CPUs, but it offered no performance enhancements, and indeed, >> appeared to decrease performance by 10-20%. Has anyone else tried this? >> What were your results? >> >> c) the defaults for the cxgbe driver appear to be 8 rx queues, and N tx >> queues, with N being the number of CPUs detected. For a system running >> multiple cards, routing or firewalling, does this make sense, or would >> balancing tx and rx be more ideal? And would reducing queues per card >> based on NUMBER-CPUS and NUM-CHELSIO-PORTS make sense at all? > The defaults are nrxq = min(8, ncores) and ntxq = min(16, ncores). The > man page mentions this. The reason for 8 vs. 16 is that tx queues are > "cheaper" as they don't have to be backed by rx buffers. It only needs > some memory for the tx descriptor ring and some hardware resources. > > It appears that your system has >= 16 cores. For forwarding it probably > makes sense to have nrxq = ntxq. If you're left with 8 or fewer cores > after disabling hyperthreading you'll automatically get 8 rx and tx > queues. Otherwise you'll have to fiddle with the hw.cxgbe.nrxq10g and > ntxq10g tunables (documented in the man page). > > >> d) dev.cxl.$PORT.qsize_rxq: 1024 and dev.cxl.$PORT.qsize_txq: 1024. >> These appear to not be writeable when if_cxgbe is loaded, so I speculate >> they are not to be messed with, or are loader.conf variables? Is there >> any benefit to messing with them? > Can't change them after the port has been administratively brought up > even once. This is mentioned in the man page. I don't really recommend > changing them any way. > >> e) dev.t5nex.$CARD.toe.sndbuf: 262144. These are writeable, but messing >> with values did not yield an immediate benefit. Am I barking up the >> wrong tree, trying? > The TOE tunables won't make a difference unless you have enabled TOE, > the TCP endpoints lie on the system, and the connections are being > handled by the TOE on the chip. This is not the case on your systems. > The driver does not enable TOE by default and the only way to use it is > to switch it on explicitly. There is no possibility that you're using > it without knowing that you are. > >> f) based on prior experiments with other vendors, I tried tweaks to >> net.isr.* settings, but did not see any benefits worth discussing. Am I >> correct in this speculation, based on others experience? >> >> g) Are there other settings I should be looking at, that may squeeze out >> a few more packets? > The pps rates that you've observed are within the chip's hardware limits > by at least an order of magnitude. Tuning the kernel rather than the > driver may be the best bang for your buck. > > Regards, > Navdeep > >> Thanks in advance! >> >> -- John Jasen (jjasen@gmail.com)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?53C3EFDC.2030100>