Date: Tue, 22 Jul 2014 12:30:27 -0700 From: Adrian Chadd <adrian@freebsd.org> To: John Jasen <jjasen@gmail.com>, FreeBSD Net <freebsd-net@freebsd.org> Cc: Navdeep Parhar <nparhar@gmail.com> Subject: Re: fastforward/routing: a 3 million packet-per-second system? Message-ID: <CAJ-Vmokje1m-LGm6B9M9t5Q4BW8JcVWbkDXyKMEVzVa%2B8reDBw@mail.gmail.com> In-Reply-To: <53CEB9B5.7020609@gmail.com> References: <53CE80DD.9090109@gmail.com> <CAJ-VmomWpc=3dtasbDhhrUpGywPio3_9W2b-RTAeJjq3nahhOQ@mail.gmail.com> <53CEB090.7030701@gmail.com> <CAJ-Vmok8eu-GhaNa%2Bi%2BBLv1ZLtKQt4yNfU7ZXW3H%2BY=2HFj=1w@mail.gmail.com> <53CEB670.9060600@gmail.com> <CAJ-VmonhCg9TvQArtP51rAUjFSe4FpFL8SNCTS6jNwk_Esk%2BEA@mail.gmail.com> <53CEB9B5.7020609@gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
hi! You can use 'pmcstat -S CPU_CLK_UNHALTED_CORE -O pmc.out' (then ctrl-C it after say 5 seconds), which will log the data to pmc.out; then 'pmcannotate -k /boot/kernel pmc.out /boot/kernel/kernel' to find out where the most cpu cycles are being spent. It should give us the location(s) inside the top CPU users. Hopefully it'll then be much more obvious! I'm glad you're digging into it! -a On 22 July 2014 12:21, John Jasen <jjasen@gmail.com> wrote: > Navdeep; > > I was struck by spending so much time in transmit as well. > > Adrian's suggestion on mining lock profiling gave me an excuse to up the > tx queues in /boot/loader.conf. Our prior conversations indicated that > up to 64 should be acceptable? > > > > > > On 07/22/2014 03:10 PM, Adrian Chadd wrote: >> Hi >> >> Right. Time to figure out why you're spending so much time in >> cxgbe_transmit() and t4_eth_tx(). Maybe ask Navdeep for some ideas? >> >> >> -a >> >> On 22 July 2014 12:07, John Jasen <jjasen@gmail.com> wrote: >>> The first is pretty easy to turn around. Reading on dtrace now. Thanks >>> for the pointers and help! >>> >>> PMC: [CPU_CLK_UNHALTED_CORE] Samples: 142654 (100.0%) , 124560 unresolved >>> >>> %SAMP IMAGE FUNCTION CALLERS >>> 34.0 if_cxgbe.k t4_eth_tx cxgbe_transmit:24.0 t4_tx_task:10.0 >>> 28.8 if_cxgbe.k cxgbe_transmit >>> 7.6 if_cxgbe.k service_iq t4_intr >>> 6.4 if_cxgbe.k get_scatter_segment service_iq >>> 4.9 if_cxgbe.k reclaim_tx_descs t4_eth_tx >>> 3.2 if_cxgbe.k write_sgl_to_txd t4_eth_tx >>> 2.8 hwpmc.ko pmclog_process_callc pmc_process_samples >>> 2.1 libc.so.7 bcopy pmclog_read >>> 1.9 if_cxgbe.k t4_eth_rx service_iq >>> 1.7 hwpmc.ko pmclog_reserve pmclog_process_callchain >>> 1.2 libpmc.so. pmclog_read >>> 1.0 if_cxgbe.k write_txpkts_wr t4_eth_tx >>> 0.8 kernel e1000_read_i2c_byte_ e1000_set_i2c_bb >>> 0.6 libc.so.7 memset >>> 0.5 if_cxgbe.k refill_fl service_iq >>> >>> >>> >>> >>> On 07/22/2014 02:45 PM, Adrian Chadd wrote: >>>> Hi, >>>> >>>> Well, start with CPU profiling with pmc: >>>> >>>> kldload hwpmc >>>> pmcstat -S CPU_CLK_UNHALTED_CORE -T -w 1 >>>> >>>> .. look at the freebsd dtrace onliners (google that) for lock >>>> contention and CPU usage. >>>> >>>> You could compile with LOCK_PROFILING (which will slow things down a >>>> little even when not in use) then enable it for a few seconds (which >>>> will definitely slow things down) to gather fine grained lock >>>> contention data. >>>> >>>> (sysctl debug.lock.prof when you have it compiled in. sysctl >>>> debug.lock.prof.enable=1; wait 10 seconds; sysctl >>>> debug.lock.prof.enable=0; sysctl debug.lock.prof.stats) >>>> >>>> >>>> -a >>>> >>>> >>>> On 22 July 2014 11:42, John Jasen <jjasen@gmail.com> wrote: >>>>> If you have ideas on what to instrument, I'll be happy to do it. >>>>> >>>>> I'm faintly familiar with dtrace, et al, so it might take me a few tries >>>>> to get it right -- or bludgeoning with the documentation. >>>>> >>>>> Thanks! >>>>> >>>>> >>>>> >>>>> >>>>> On 07/22/2014 02:07 PM, Adrian Chadd wrote: >>>>>> Hi! >>>>>> >>>>>> Well, what's missing is some dtrace/pmc/lockdebugging investigations >>>>>> into the system to see where it's currently maxing out at. >>>>>> >>>>>> I wonder if you're seeing contention on the transmit paths as drivers >>>>>> queue frames from one set of driver threads/queues to another >>>>>> potentially completely different set of driver transmit >>>>>> threads/queues. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -a >>>>>> >>>>>> >>>>>> On 22 July 2014 08:18, John Jasen <jjasen@gmail.com> wrote: >>>>>>> Feedback and/or tips and tricks more than welcome. >>>>>>> >>>>>>> Outstanding questions: >>>>>>> >>>>>>> Would increasing the number of processor cores help? >>>>>>> >>>>>>> Would a system where both processor QPI ports connect to each other >>>>>>> mitigate QPI bottlenecks? >>>>>>> >>>>>>> Are there further performance optimizations I am missing? >>>>>>> >>>>>>> Server Description: >>>>>>> >>>>>>> The system in question is a Dell Poweredge R820, 16GB of RAM, and two >>>>>>> Intel(R) Xeon(R) CPU E5-4610 0 @ 2.40GHz. >>>>>>> >>>>>>> Onboard, in a 16x PCIe slot, I have one Chelsio T-580-CR two-port 40GbE >>>>>>> NIC, and in an 8x slot, another T-580-CR dual port. >>>>>>> >>>>>>> I am running FreeBSD 10.0-STABLE. >>>>>>> >>>>>>> BIOS tweaks: >>>>>>> >>>>>>> Hyperthreading (or Logical Processors) is turned off. >>>>>>> Memory Node Interleaving is turned off, but did not appear to impact >>>>>>> performance. >>>>>>> >>>>>>> /boot/loader.conf contents: >>>>>>> #for CARP+PF testing >>>>>>> carp_load="YES" >>>>>>> #load cxgbe drivers. >>>>>>> cxgbe_load="YES" >>>>>>> #maxthreads appears to not exceed CPU. >>>>>>> net.isr.maxthreads=12 >>>>>>> #bindthreads may be indicated when using cpuset(1) on interrupts >>>>>>> net.isr.bindthreads=1 >>>>>>> #random guess based on googling >>>>>>> net.isr.maxqlimit=60480 >>>>>>> net.link.ifqmaxlen=90000 >>>>>>> #discussions with cxgbe maintainer and list led me to trying this. >>>>>>> Allows more interrupts >>>>>>> #to be fixed to CPUs, which in some cases, improves interrupt balancing. >>>>>>> hw.cxgbe.ntxq10g=16 >>>>>>> hw.cxgbe.nrxq10g=16 >>>>>>> >>>>>>> /etc/sysctl.conf contents: >>>>>>> >>>>>>> #the following is also enabled by rc.conf gateway_enable. >>>>>>> net.inet.ip.fastforwarding=1 >>>>>>> #recommendations from BSD router project >>>>>>> kern.random.sys.harvest.ethernet=0 >>>>>>> kern.random.sys.harvest.point_to_point=0 >>>>>>> kern.random.sys.harvest.interrupt=0 >>>>>>> #probably should be removed, as cxgbe does not seem to affect/be >>>>>>> affected by irq storm settings >>>>>>> hw.intr_storm_threshold=25000000 >>>>>>> #based on Calomel.Org performance suggestions. 4x40GbE, seemed >>>>>>> reasonable to use 100GbE settings >>>>>>> kern.ipc.maxsockbuf=1258291200 >>>>>>> net.inet.tcp.recvbuf_max=1258291200 >>>>>>> net.inet.tcp.sendbuf_max=1258291200 >>>>>>> #attempting to play with ULE scheduler, making it serve packets versus >>>>>>> netstat >>>>>>> kern.sched.slice=1 >>>>>>> kern.sched.interact=1 >>>>>>> >>>>>>> /etc/rc.conf contains: >>>>>>> >>>>>>> hostname="fbge1" >>>>>>> #should remove, especially given below duplicate entry >>>>>>> ifconfig_igb0="DHCP" >>>>>>> sshd_enable="YES" >>>>>>> #ntpd_enable="YES" >>>>>>> # Set dumpdev to "AUTO" to enable crash dumps, "NO" to disable >>>>>>> dumpdev="AUTO" >>>>>>> # OpenBSD PF options to play with later. very bad for raw packet rates. >>>>>>> #pf_enable="YES" >>>>>>> #pflog_enable="YES" >>>>>>> # enable packet forwarding >>>>>>> # these enable forwarding and fastforwarding sysctls. inet6 does not >>>>>>> have fastforward >>>>>>> gateway_enable="YES" >>>>>>> ipv6_gateway_enable="YES" >>>>>>> # enable OpenBSD ftp-proxy >>>>>>> # should comment out until actively playing with PF >>>>>>> ftpproxy_enable="YES" >>>>>>> #left in place, commented out from prior testing >>>>>>> #ifconfig_mlxen1="inet 172.16.2.1 netmask 255.255.255.0 mtu 9000" >>>>>>> #ifconfig_mlxen0="inet 172.16.1.1 netmask 255.255.255.0 mtu 9000" >>>>>>> #ifconfig_mlxen3="inet 172.16.7.1 netmask 255.255.255.0 mtu 9000" >>>>>>> #ifconfig_mlxen2="inet 172.16.8.1 netmask 255.255.255.0 mtu 9000" >>>>>>> # -lro and -tso options added per mailing list suggestion from Bjoern A. >>>>>>> Zeeb (bzeeb-lists at lists.zabbadoz.net) >>>>>>> ifconfig_cxl0="inet 172.16.3.1 netmask 255.255.255.0 mtu 9000 -lro -tso up" >>>>>>> ifconfig_cxl1="inet 172.16.4.1 netmask 255.255.255.0 mtu 9000 -lro -tso up" >>>>>>> ifconfig_cxl2="inet 172.16.5.1 netmask 255.255.255.0 mtu 9000 -lro -tso up" >>>>>>> ifconfig_cxl3="inet 172.16.6.1 netmask 255.255.255.0 mtu 9000 -lro -tso up" >>>>>>> # aliases instead of reconfiguring test clients. See above commented out >>>>>>> entries >>>>>>> ifconfig_cxl0_alias0="172.16.7.1 netmask 255.255.255.0" >>>>>>> ifconfig_cxl1_alias0="172.16.8.1 netmask 255.255.255.0" >>>>>>> ifconfig_cxl2_alias0="172.16.1.1 netmask 255.255.255.0" >>>>>>> ifconfig_cxl3_alias0="172.16.2.1 netmask 255.255.255.0" >>>>>>> # for remote monitoring/admin of the test device >>>>>>> ifconfig_igb0="inet 172.30.60.60 netmask 255.255.0.0" >>>>>>> >>>>>>> Additional configurations: >>>>>>> cpuset-chelsio-6cpu-high >>>>>>> # Original provided by Navdeep Parhar <nparhar@gmail.com> >>>>>>> # takes vmstat -ai output into a list, and assigns interrupts in order to >>>>>>> # the available CPU cores. >>>>>>> # Modified: to assign only to the 'high CPUs', ie: on core1. >>>>>>> # See: http://lists.freebsd.org/pipermail/freebsd-net/2014-July/039317.html >>>>>>> #!/usr/local/bin/bash >>>>>>> ncpu=12 >>>>>>> irqlist=$(vmstat -ia | egrep 't4nex|t5nex|cxgbc' | cut -f1 -d: | cut -c4-) >>>>>>> i=6 >>>>>>> for irq in $irqlist; do >>>>>>> cpuset -l $i -x $irq >>>>>>> i=$((i+1)) >>>>>>> [ $i -ge $ncpu ] && i=6 >>>>>>> done >>>>>>> >>>>>>> Client Description: >>>>>>> >>>>>>> Two Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz processors >>>>>>> 64 GB ram >>>>>>> Mellanox Technologies MT27500 Family [ConnectX-3] >>>>>>> Centos 6.4 with updates >>>>>>> iperf3 installed from yum repositories: iperf3-3.0.3-3.el6.x86_64 >>>>>>> >>>>>>> Test setup: >>>>>>> >>>>>>> I've found about 3 streams between Centos clients is about the best way >>>>>>> to get the most out of them. >>>>>>> Above certain points, the -b flag does not change results. >>>>>>> -N is an artifact from using TCP >>>>>>> -l is needed, as -M doesn't work for UDP. >>>>>>> >>>>>>> I usually use launch scripts similar to the following: >>>>>>> >>>>>>> for i in `seq 41 60`; do ssh loader$i "export TIME=120; export >>>>>>> STREAMS=1; export PORT=52$i; export PKT=64; export RATE=2000m; >>>>>>> /root/iperf-test-8port-udp" & done >>>>>>> >>>>>>> The scripts execute the following on each host. >>>>>>> >>>>>>> #!/bin/bash >>>>>>> PORT1=$PORT >>>>>>> PORT2=$(($PORT+1000)) >>>>>>> PORT3=$(($PORT+2000)) >>>>>>> iperf3 -c loader41-40gbe -u -b 10000m -i 0 -N -l $PKT -t$TIME >>>>>>> -P$STREAMS -p$PORT1 & >>>>>>> iperf3 -c loader42-40gbe -u -b 10000m -i 0 -N -l $PKT -t$TIME >>>>>>> -P$STREAMS -p$PORT1 & >>>>>>> iperf3 -c loader43-40gbe -u -b 10000m -i 0 -N -l $PKT -t$TIME >>>>>>> -P$STREAMS -p$PORT1 & >>>>>>> ... (through all clients and all three ports) ... >>>>>>> iperf3 -c loader60-40gbe -u -b 10000m -i 0 -N -l $PKT -t$TIME >>>>>>> -P$STREAMS -p$PORT3 & >>>>>>> >>>>>>> >>>>>>> Results: >>>>>>> >>>>>>> Summarized, netstat -w 1 -q 240 -bd, run through: >>>>>>> cat test4-tuning | egrep -v {'packets | input '} | awk '{ipackets+=$1} >>>>>>> {idrops+=$3} {opackets+=$5} {odrops+=$9} END {print "input " >>>>>>> ipackets/NR, "idrops " idrops/NR, "opackets " opackets/NR, "odrops " >>>>>>> odrops/NR}' >>>>>>> >>>>>>> input 1.10662e+07 idrops 8.01783e+06 opackets 3.04516e+06 odrops 3152.4 >>>>>>> >>>>>>> Snapshot of raw output: >>>>>>> >>>>>>> input (Total) output >>>>>>> packets errs idrops bytes packets errs bytes colls drops >>>>>>> 11189148 0 7462453 1230805216 3725006 0 409750710 0 799 >>>>>>> 10527505 0 6746901 1158024978 3779096 0 415700708 0 127 >>>>>>> 10606163 0 6850760 1166676673 3751780 0 412695761 0 1535 >>>>>>> 10749324 0 7132014 1182425799 3617558 0 397930956 0 5972 >>>>>>> 10695667 0 7022717 1176521907 3669342 0 403627236 0 1461 >>>>>>> 10441173 0 6762134 1148528662 3675048 0 404255540 0 6021 >>>>>>> 10683773 0 7005635 1175215014 3676962 0 404465671 0 2606 >>>>>>> 10869859 0 7208696 1195683372 3658432 0 402427698 0 979 >>>>>>> 11948989 0 8310926 1314387881 3633773 0 399714986 0 725 >>>>>>> 12426195 0 8864415 1366877194 3562311 0 391853156 0 2762 >>>>>>> 13006059 0 9432389 1430661751 3570067 0 392706552 0 5158 >>>>>>> 12822243 0 9098871 1410443600 3715177 0 408668500 0 4064 >>>>>>> 13317864 0 9683602 1464961374 3632156 0 399536131 0 3684 >>>>>>> 13701905 0 10182562 1507207982 3523101 0 387540859 0 >>>>>>> 8690 >>>>>>> 13820227 0 10244870 1520221820 3562038 0 391823322 0 >>>>>>> 2426 >>>>>>> 14437060 0 10955483 1588073033 3480105 0 382810557 0 >>>>>>> 2619 >>>>>>> 14518471 0 11119573 1597028105 3397439 0 373717355 0 >>>>>>> 5691 >>>>>>> 14890287 0 11675003 1637926521 3199812 0 351978304 0 >>>>>>> 11007 >>>>>>> 14923610 0 11749091 1641594441 3171436 0 348857468 0 >>>>>>> 7389 >>>>>>> 14738704 0 11609730 1621254991 3117715 0 342948394 0 >>>>>>> 2597 >>>>>>> 14753975 0 11549735 1622935026 3207393 0 352812846 0 >>>>>>> 4798 >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> freebsd-net@freebsd.org mailing list >>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-net >>>>>>> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-Vmokje1m-LGm6B9M9t5Q4BW8JcVWbkDXyKMEVzVa%2B8reDBw>