From owner-freebsd-net@freebsd.org  Tue Aug 11 23:28:32 2015
Return-Path: <owner-freebsd-net@freebsd.org>
Delivered-To: freebsd-net@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 1B76799F474
 for <freebsd-net@mailman.ysv.freebsd.org>;
 Tue, 11 Aug 2015 23:28:32 +0000 (UTC)
 (envelope-from sobomax@sippysoft.com)
Received: from mail-wi0-f180.google.com (mail-wi0-f180.google.com
 [209.85.212.180])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id A71878B5
 for <freebsd-net@freebsd.org>; Tue, 11 Aug 2015 23:28:31 +0000 (UTC)
 (envelope-from sobomax@sippysoft.com)
Received: by wicne3 with SMTP id ne3so196585227wic.1
 for <freebsd-net@freebsd.org>; Tue, 11 Aug 2015 16:28:28 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:mime-version:sender:date:message-id:subject:from
 :to:cc:content-type;
 bh=Sj4vR0eg/BXhE+tzgNWQnUVTvHp2KNEeSXIWvOKBfYc=;
 b=kFQ2dcf1v7aVtLRQni073Qgm5rQATsrrZW4E46Wx1LRuXwpButGhIRs1ETdbZm/ffD
 XTf1PODjq7XNxCzUALIeitd+ut/vEmv35FJS8FmXhp6MqvyfZzKYE+QhbXWeBc35GztK
 kzLefjdGRtEv6QEYXvFZ0pr2VNKs2jgZWfptff6ov/XdNZGLLcxBdEh1gTljUyWCmHRP
 7y9kmx6b8RqfF5xkkdHyxbmSXu2I1Uvo7HPeCrmOkF+Z3HQd5kshwcqCvWgsivV8LAWM
 26yCMQGvN1jJG9rEbhz5Q4Tc5jEK0DYqaiOCJ58gaMHKO12LGJNp6w3pZrn2XhXrKT/D
 3RcQ==
X-Gm-Message-State: ALoCoQkr/z7gQwUTMP4uPeEDz2IQQZtgofo0tpRhizaskJFGBBUP5IA497ivCdtV8lybkQSbLDnX
MIME-Version: 1.0
X-Received: by 10.180.105.66 with SMTP id gk2mr22291681wib.73.1439335708827;
 Tue, 11 Aug 2015 16:28:28 -0700 (PDT)
Sender: sobomax@sippysoft.com
Received: by 10.27.143.15 with HTTP; Tue, 11 Aug 2015 16:28:28 -0700 (PDT)
Date: Tue, 11 Aug 2015 16:28:28 -0700
X-Google-Sender-Auth: CD4jT_nH51QoV4zHzfli9eFiPU8
Message-ID: <CAH7qZft-CZCKv_7E9PE+4ZN3EExhezMnAb3kvShQzYhRYb2jMg@mail.gmail.com>
Subject: Re: Poor high-PPS performance of the 10G ixgbe(9) NIC/driver in
 FreeBSD 10.1
From: Maxim Sobolev <sobomax@FreeBSD.org>
To: =?UTF-8?Q?Olivier_Cochard=2DLabb=C3=A9?= <olivier@cochard.me>
Cc: FreeBSD Net <freebsd-net@freebsd.org>, freebsd@intel.com, 
 =?UTF-8?Q?Jev_Bj=C3=B6rsell?= <jev@sippysoft.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.20
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 11 Aug 2015 23:28:32 -0000

Olivier, keep in mind that we are not "kernel forwarding" packets, but "app
forwarding", i.e. the packet goes full way
net->kernel->recvfrom->app->sendto->kernel->net, which is why we have much
lower PPS limits and which is why I think we are actually benefiting from
the extra queues. Single-thread sendto() in a loop is CPU-bound at about
220K PPS, and while running the test I am observing that outbound traffic
from one thread is mapped into a specific queue (well, pair of queues on
two separate adaptors, due to lagg load balancing action). And the peak
performance of that test is at 7 threads, which I believe corresponds to
the number of queues. We have plenty of CPU cores in the box (24) with
HTT/SMT disabled and one CPU is mapped to a specific queue. This leaves us
with at least 8 CPUs fully capable of running our app. If you look at the
CPU utilization, we are at about 10% when the issue hits.

ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 2.5.15> port
0x6020-0x603f mem 0xc7c00000-0xc7dfffff,0xc7e04000-0xc7e07fff irq 40 at
device 0.0 on pci3
ix0: Using MSIX interrupts with 9 vectors
ix0: Bound queue 0 to cpu 0
ix0: Bound queue 1 to cpu 1
ix0: Bound queue 2 to cpu 2
ix0: Bound queue 3 to cpu 3
ix0: Bound queue 4 to cpu 4
ix0: Bound queue 5 to cpu 5
ix0: Bound queue 6 to cpu 6
ix0: Bound queue 7 to cpu 7
ix0: Ethernet address: 0c:c4:7a:5e:be:64
ix0: PCI Express Bus: Speed 5.0GT/s Width x8
001.000008 [2705] netmap_attach             success for ix0 tx 8/4096 rx
8/4096 queues/slots
ix1: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 2.5.15> port
0x6000-0x601f mem 0xc7a00000-0xc7bfffff,0xc7e00000-0xc7e03fff irq 44 at
device 0.1 on pci3
ix1: Using MSIX interrupts with 9 vectors
ix1: Bound queue 0 to cpu 8
ix1: Bound queue 1 to cpu 9
ix1: Bound queue 2 to cpu 10
ix1: Bound queue 3 to cpu 11
ix1: Bound queue 4 to cpu 12
ix1: Bound queue 5 to cpu 13
ix1: Bound queue 6 to cpu 14
ix1: Bound queue 7 to cpu 15
ix1: Ethernet address: 0c:c4:7a:5e:be:65
ix1: PCI Express Bus: Speed 5.0GT/s Width x8
001.000009 [2705] netmap_attach             success for ix1 tx 8/4096 rx
8/4096 queues/slots


On Tue, Aug 11, 2015 at 4:14 PM, Olivier Cochard-Labb=C3=A9 <olivier@cochar=
d.me>
wrote:

> On Tue, Aug 11, 2015 at 11:18 PM, Maxim Sobolev <sobomax@freebsd.org>
> wrote:
>
>> Hi folks,
>>
>> =E2=80=8BHi,
> =E2=80=8B
>
>
>> We've trying to migrate some of our high-PPS systems to a new hardware
>> that
>> has four X540-AT2 10G NICs and observed that interrupt time goes through
>> roof after we cross around 200K PPS in and 200K out (two ports in LACP).
>> The previous hardware was stable up to about 350K PPS in and 350K out. I
>> believe the old one was equipped with the I350 and had the identical LAC=
P
>> configuration. The new box also has better CPU with more cores (i.e. 24
>> cores vs. 16 cores before). CPU itself is 2 x E5-2690 v3.
>>
>
> =E2=80=8B200K PPS, and even 350K PPS are very low value indeed.
> On a Intel Xeon L5630 (4 cores only) with one X540-AT2=E2=80=8B
>
> =E2=80=8B(then 2 10Gigabit ports)=E2=80=8B I've reached about 1.8Mpps (fa=
stforwarding
> enabled) [1].
> But my setup didn't use lagg(4): Can you disable lagg configuration and
> re-measure your performance without lagg ?
>
> Do you let Intel NIC drivers using 8 queues for port too?
> In my use case (forwarding smallest UDP packet size), I obtain better
> behaviour by limiting NIC queues to 4 (hw.ix.num_queues or
> hw.ixgbe.num_queues, don't remember) if my system had 8 cores. And this
> with Gigabit Intel[2] or Chelsio NIC [3].
>
> Don't forget to disable TSO and LRO too.
>
> =E2=80=8BRegards,
>
> Olivier
>
> [1]
> http://bsdrp.net/documentation/examples/forwarding_performance_lab_of_an_=
ibm_system_x3550_m3_with_10-gigabit_intel_x540-at2#graphs
> [2]
> http://bsdrp.net/documentation/examples/forwarding_performance_lab_of_a_s=
uperserver_5018a-ftn4#graph1
> [3]
> http://bsdrp.net/documentation/examples/forwarding_performance_lab_of_a_h=
p_proliant_dl360p_gen8_with_10-gigabit_with_10-gigabit_chelsio_t540-cr#redu=
cing_nic_queues
>