From owner-freebsd-net@freebsd.org Wed Aug 12 15:05:37 2015 Return-Path: Delivered-To: freebsd-net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 6C8E099F35D for ; Wed, 12 Aug 2015 15:05:37 +0000 (UTC) (envelope-from sobomax@sippysoft.com) Received: from mail-wi0-f182.google.com (mail-wi0-f182.google.com [209.85.212.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id EEA5788B for ; Wed, 12 Aug 2015 15:05:36 +0000 (UTC) (envelope-from sobomax@sippysoft.com) Received: by wicja10 with SMTP id ja10so116756791wic.1 for ; Wed, 12 Aug 2015 08:05:35 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:date:message-id:subject:from :to:cc:content-type; bh=3DvJfybpnWk9FQ7cZRFNrgiz6qE4u10UXjlOaJV9CfM=; b=AE2WrU8Wh3nlERvSflcKvINIMER+72EbBDFGdxtDgSJ0zaVmNyn65+Xzc1MpsUdQuo YgknVxhY56IPlj5XQDKIgjnOAVi0yf/k2CGqxshj/f6WEAGdiUl3LLA2Zq5NmZoybqU3 NOgx+NbKzQz8ZYMRylOmM8rS7KY6uoWiRgOX0OLtGUH2t4NEdCpNP/D2Ir2ybgntaKCT h+9nN11N9J795xyFFK7xb+augdHCfHIn9Vj9CHg6irjRFj0B0qWdGx3vbw3g5Mj4PEKT K04o7zzXrbqSgbcznq98cmyxQjJQ3S2nqUBgjP8ZJr4BuYz1S3SFilKjBv/tpPCBGk+s HJzw== X-Gm-Message-State: ALoCoQlxhr2b/lBLg5yM98n/bWDCROkxA+Hp0swKYBGX9gBrCUR3OmKyiPnldAGA4z0ToTO5G202 MIME-Version: 1.0 X-Received: by 10.180.81.100 with SMTP id z4mr46455058wix.8.1439391933253; Wed, 12 Aug 2015 08:05:33 -0700 (PDT) Sender: sobomax@sippysoft.com Received: by 10.27.143.15 with HTTP; Wed, 12 Aug 2015 08:05:33 -0700 (PDT) Date: Wed, 12 Aug 2015 08:05:33 -0700 X-Google-Sender-Auth: JiwFI6aWSfREyd2AegNlkQP74CA Message-ID: Subject: Re: Poor high-PPS performance of the 10G ixgbe(9) NIC/driver in FreeBSD 10.1 From: Maxim Sobolev To: Luigi Rizzo Cc: Babak Farrokhi , "Alexander V. Chernikov" , =?UTF-8?Q?Olivier_Cochard=2DLabb=C3=A9?= , "freebsd@intel.com" , =?UTF-8?Q?Jev_Bj=C3=B6rsell?= , FreeBSD Net Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.20 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Aug 2015 15:05:37 -0000 igb0@pci0:7:0:0: class=3D0x020000 card=3D0x153315d9 chip=3D0x1533808= 6 rev=3D0x03 hdr=3D0x00 vendor =3D 'Intel Corporation' device =3D 'I210 Gigabit Network Connection' class =3D network subclass =3D ethernet igb1@pci0:8:0:0: class=3D0x020000 card=3D0x153315d9 chip=3D0x1533808= 6 rev=3D0x03 hdr=3D0x00 vendor =3D 'Intel Corporation' device =3D 'I210 Gigabit Network Connection' class =3D network subclass =3D ethernet On Wed, Aug 12, 2015 at 8:03 AM, Maxim Sobolev wrote: > Ok, so my current settings are: > > hw.ix.max_interrupt_rate: 20000 > dev.ix.0.queue0.interrupt_rate: 20000 > dev.ix.0.queue1.interrupt_rate: 20000 > dev.ix.0.queue2.interrupt_rate: 20000 > dev.ix.0.queue3.interrupt_rate: 20000 > dev.ix.0.queue4.interrupt_rate: 20000 > dev.ix.0.queue5.interrupt_rate: 20000 > dev.ix.1.queue0.interrupt_rate: 20000 > dev.ix.1.queue1.interrupt_rate: 20000 > dev.ix.1.queue2.interrupt_rate: 20000 > dev.ix.1.queue3.interrupt_rate: 20000 > dev.ix.1.queue4.interrupt_rate: 20000 > dev.ix.1.queue5.interrupt_rate: 20000 > dev.ix.0.enable_aim: 0 > dev.ix.1.enable_aim: 0 > dev.ix.2.enable_aim: 0 > dev.ix.3.enable_aim: 0 > hw.ix.num_queues:6 > > We also happen to have I210-based system with only 4 hardware queues, it > would be interesting to see how it stacks up. > > On Wed, Aug 12, 2015 at 5:23 AM, Luigi Rizzo wrote: > >> As I was telling to maxim, you should disable aim because it only matche= s >> the max interrupt rate to the average packet size, which is the last thi= ng >> you want. >> >> Setting the interrupt rate with sysctl (one per queue) gives you precise >> control on the max rate and (hence, extra latency). 20k interrupts/s giv= e >> you 50us of latency, and the 2k slots in the queue are still enough to >> absorb a burst of min-sized frames hitting a single queue (the os will >> start dropping long before that level, but that's another story). >> >> Cheers >> Luigi >> >> On Wednesday, August 12, 2015, Babak Farrokhi >> wrote: >> >>> I ran into the same problem with almost the same hardware (Intel X520) >>> on 10-STABLE. HT/SMT is disabled and cards are configured with 8 queues= , >>> with the same sysctl tunings as sobomax@ did. I am not using lagg, no >>> FLOWTABLE. >>> >>> I experimented with pmcstat (RESOURCE_STALLS) a while ago and here [1] >>> [2] you can see the results, including pmc output, callchain, flamegrap= h >>> and gprof output. >>> >>> I am experiencing huge number of interrupts with 200kpps load: >>> >>> # sysctl dev.ix | grep interrupt_rate >>> dev.ix.1.queue7.interrupt_rate: 125000 >>> dev.ix.1.queue6.interrupt_rate: 6329 >>> dev.ix.1.queue5.interrupt_rate: 500000 >>> dev.ix.1.queue4.interrupt_rate: 100000 >>> dev.ix.1.queue3.interrupt_rate: 50000 >>> dev.ix.1.queue2.interrupt_rate: 500000 >>> dev.ix.1.queue1.interrupt_rate: 500000 >>> dev.ix.1.queue0.interrupt_rate: 100000 >>> dev.ix.0.queue7.interrupt_rate: 500000 >>> dev.ix.0.queue6.interrupt_rate: 6097 >>> dev.ix.0.queue5.interrupt_rate: 10204 >>> dev.ix.0.queue4.interrupt_rate: 5208 >>> dev.ix.0.queue3.interrupt_rate: 5208 >>> dev.ix.0.queue2.interrupt_rate: 71428 >>> dev.ix.0.queue1.interrupt_rate: 5494 >>> dev.ix.0.queue0.interrupt_rate: 6250 >>> >>> [1] http://farrokhi.net/~farrokhi/pmc/6/ >>> [2] http://farrokhi.net/~farrokhi/pmc/7/ >>> >>> Regards, >>> Babak >>> >>> >>> Alexander V. Chernikov wrote: >>> > 12.08.2015, 02:28, "Maxim Sobolev" : >>> >> Olivier, keep in mind that we are not "kernel forwarding" packets, >>> but "app >>> >> forwarding", i.e. the packet goes full way >>> >> net->kernel->recvfrom->app->sendto->kernel->net, which is why we hav= e >>> much >>> >> lower PPS limits and which is why I think we are actually benefiting >>> from >>> >> the extra queues. Single-thread sendto() in a loop is CPU-bound at >>> about >>> >> 220K PPS, and while running the test I am observing that outbound >>> traffic >>> >> from one thread is mapped into a specific queue (well, pair of queue= s >>> on >>> >> two separate adaptors, due to lagg load balancing action). And the >>> peak >>> >> performance of that test is at 7 threads, which I believe correspond= s >>> to >>> >> the number of queues. We have plenty of CPU cores in the box (24) wi= th >>> >> HTT/SMT disabled and one CPU is mapped to a specific queue. This >>> leaves us >>> >> with at least 8 CPUs fully capable of running our app. If you look a= t >>> the >>> >> CPU utilization, we are at about 10% when the issue hits. >>> > >>> > In any case, it would be great if you could provide some profiling >>> info since there could be >>> > plenty of problematic places starting from TX rings contention to som= e >>> locks inside udp or even >>> > (in)famous random entropy harvester.. >>> > e.g. something like pmcstat -TS instructions -w1 might be sufficient >>> to determine the reason >>> >> ix0: >> 2.5.15> port >>> >> 0x6020-0x603f mem 0xc7c00000-0xc7dfffff,0xc7e04000-0xc7e07fff irq 40 >>> at >>> >> device 0.0 on pci3 >>> >> ix0: Using MSIX interrupts with 9 vectors >>> >> ix0: Bound queue 0 to cpu 0 >>> >> ix0: Bound queue 1 to cpu 1 >>> >> ix0: Bound queue 2 to cpu 2 >>> >> ix0: Bound queue 3 to cpu 3 >>> >> ix0: Bound queue 4 to cpu 4 >>> >> ix0: Bound queue 5 to cpu 5 >>> >> ix0: Bound queue 6 to cpu 6 >>> >> ix0: Bound queue 7 to cpu 7 >>> >> ix0: Ethernet address: 0c:c4:7a:5e:be:64 >>> >> ix0: PCI Express Bus: Speed 5.0GT/s Width x8 >>> >> 001.000008 [2705] netmap_attach success for ix0 tx 8/4096 rx >>> >> 8/4096 queues/slots >>> >> ix1: >> 2.5.15> port >>> >> 0x6000-0x601f mem 0xc7a00000-0xc7bfffff,0xc7e00000-0xc7e03fff irq 44 >>> at >>> >> device 0.1 on pci3 >>> >> ix1: Using MSIX interrupts with 9 vectors >>> >> ix1: Bound queue 0 to cpu 8 >>> >> ix1: Bound queue 1 to cpu 9 >>> >> ix1: Bound queue 2 to cpu 10 >>> >> ix1: Bound queue 3 to cpu 11 >>> >> ix1: Bound queue 4 to cpu 12 >>> >> ix1: Bound queue 5 to cpu 13 >>> >> ix1: Bound queue 6 to cpu 14 >>> >> ix1: Bound queue 7 to cpu 15 >>> >> ix1: Ethernet address: 0c:c4:7a:5e:be:65 >>> >> ix1: PCI Express Bus: Speed 5.0GT/s Width x8 >>> >> 001.000009 [2705] netmap_attach success for ix1 tx 8/4096 rx >>> >> 8/4096 queues/slots >>> >> >>> >> On Tue, Aug 11, 2015 at 4:14 PM, Olivier Cochard-Labb=C3=A9 < >>> olivier@cochard.me> >>> >> wrote: >>> >> >>> >>> On Tue, Aug 11, 2015 at 11:18 PM, Maxim Sobolev < >>> sobomax@freebsd.org> >>> >>> wrote: >>> >>> >>> >>>> Hi folks, >>> >>>> >>> >>>> =E2=80=8BHi, >>> >>> =E2=80=8B >>> >>> >>> >>>> We've trying to migrate some of our high-PPS systems to a new >>> hardware >>> >>>> that >>> >>>> has four X540-AT2 10G NICs and observed that interrupt time goes >>> through >>> >>>> roof after we cross around 200K PPS in and 200K out (two ports in >>> LACP). >>> >>>> The previous hardware was stable up to about 350K PPS in and 350K >>> out. I >>> >>>> believe the old one was equipped with the I350 and had the >>> identical LACP >>> >>>> configuration. The new box also has better CPU with more cores >>> (i.e. 24 >>> >>>> cores vs. 16 cores before). CPU itself is 2 x E5-2690 v3. >>> >>> =E2=80=8B200K PPS, and even 350K PPS are very low value indeed. >>> >>> On a Intel Xeon L5630 (4 cores only) with one X540-AT2=E2=80=8B >>> >>> >>> >>> =E2=80=8B(then 2 10Gigabit ports)=E2=80=8B I've reached about 1.8M= pps >>> (fastforwarding >>> >>> enabled) [1]. >>> >>> But my setup didn't use lagg(4): Can you disable lagg configuratio= n >>> and >>> >>> re-measure your performance without lagg ? >>> >>> >>> >>> Do you let Intel NIC drivers using 8 queues for port too? >>> >>> In my use case (forwarding smallest UDP packet size), I obtain >>> better >>> >>> behaviour by limiting NIC queues to 4 (hw.ix.num_queues or >>> >>> hw.ixgbe.num_queues, don't remember) if my system had 8 cores. And >>> this >>> >>> with Gigabit Intel[2] or Chelsio NIC [3]. >>> >>> >>> >>> Don't forget to disable TSO and LRO too. >>> >>> >>> >>> =E2=80=8BRegards, >>> >>> >>> >>> Olivier >>> >>> >>> >>> [1] >>> >>> >>> http://bsdrp.net/documentation/examples/forwarding_performance_lab_of_a= n_ibm_system_x3550_m3_with_10-gigabit_intel_x540-at2#graphs >>> >>> [2] >>> >>> >>> http://bsdrp.net/documentation/examples/forwarding_performance_lab_of_a= _superserver_5018a-ftn4#graph1 >>> >>> [3] >>> >>> >>> http://bsdrp.net/documentation/examples/forwarding_performance_lab_of_a= _hp_proliant_dl360p_gen8_with_10-gigabit_with_10-gigabit_chelsio_t540-cr#re= ducing_nic_queues >>> >> _______________________________________________ >>> >> freebsd-net@freebsd.org mailing list >>> >> http://lists.freebsd.org/mailman/listinfo/freebsd-net >>> >> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.or= g >>> " >>> > _______________________________________________ >>> > freebsd-net@freebsd.org mailing list >>> > http://lists.freebsd.org/mailman/listinfo/freebsd-net >>> > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org= " >>> _______________________________________________ >>> freebsd-net@freebsd.org mailing list >>> http://lists.freebsd.org/mailman/listinfo/freebsd-net >>> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" >> >> >> >> -- >> -----------------------------------------+------------------------------= - >> Prof. Luigi RIZZO, rizzo@iet.unipi.it . Dip. di Ing. dell'Informazione >> http://www.iet.unipi.it/~luigi/ . Universita` di Pisa >> TEL +39-050-2217533 . via Diotisalvi 2 >> Mobile +39-338-6809875 . 56122 PISA (Italy) >> -----------------------------------------+------------------------------= - >> >> > > > -- > Maksym Sobolyev > Sippy Software, Inc. > Internet Telephony (VoIP) Experts > Tel (Canada): +1-778-783-0474 > Tel (Toll-Free): +1-855-747-7779 > Fax: +1-866-857-6942 > Web: http://www.sippysoft.com > MSN: sales@sippysoft.com > Skype: SippySoft >