From owner-freebsd-hackers@FreeBSD.ORG  Sun Mar 29 05:46:55 2015
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 12749614
 for <freebsd-hackers@freebsd.org>; Sun, 29 Mar 2015 05:46:55 +0000 (UTC)
Received: from mail-ig0-x231.google.com (mail-ig0-x231.google.com
 [IPv6:2607:f8b0:4001:c05::231])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id C53AAB1
 for <freebsd-hackers@freebsd.org>; Sun, 29 Mar 2015 05:46:54 +0000 (UTC)
Received: by igbud6 with SMTP id ud6so50384832igb.1
 for <freebsd-hackers@freebsd.org>; Sat, 28 Mar 2015 22:46:54 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:date:message-id:subject
 :from:to:cc:content-type;
 bh=oaCjc5j2KBw4djgruhCQL9/rMMKvlm2jIBomIWevq7Q=;
 b=gX5q5mGv+8wP4KzayOC2HZnWYCMRtZb0A1ahL9uyKAtJdE2JyxkjHAWTwNeSiIYzkQ
 2EiIfWGHXKM16EkD3dU0BbpxYZ2xJAG7oxW84cTzvM+96bHeQToj4zJKX92YF82OFFSp
 CkiPpYetWKSU6kNyb5nWma90hlb0g6aJPhCyRkkGnixaemBQW6fu5bzmeQqEO+dsL3LW
 dvhnadyNhY4614zRdwncAj/GehI/6DrYE48TxKRrCLOuOzJnpgonUd1nYg3EbFcMX52P
 pxgryTjwEunWWe7eq4Ij2ItNvSGlUWJM8JnoQhPaLGzGxzzxZ3l9j07YevkSQ0D/4wsV
 9pew==
MIME-Version: 1.0
X-Received: by 10.107.155.13 with SMTP id d13mr39758797ioe.29.1427608014160;
 Sat, 28 Mar 2015 22:46:54 -0700 (PDT)
Sender: adrian.chadd@gmail.com
Received: by 10.36.17.194 with HTTP; Sat, 28 Mar 2015 22:46:54 -0700 (PDT)
In-Reply-To: <20150329003354.GK23643@zxy.spb.ru>
References: <20150328201219.GF23643@zxy.spb.ru>
 <CAJ-Vmo=wecgoVYcS14gsOnT86p=HEMdao65aXTi7jLfVVyOELg@mail.gmail.com>
 <20150328221621.GG23643@zxy.spb.ru>
 <CAJ-Vmomd6Z5Ou7cvV1Kg4m=X2907507hqKMWiz6ssZ45Pi_-Dg@mail.gmail.com>
 <20150328224634.GH23643@zxy.spb.ru>
 <CAJ-VmokwGgHGP6AjBcGbyJShBPX6dyJjjNeCBcjxLi1obaiRtQ@mail.gmail.com>
 <20150328230533.GI23643@zxy.spb.ru>
 <CAJ-VmongWE_z7Rod8-SoFmyiLqiTbHtSaAwjgAs05L_Z3jrWXA@mail.gmail.com>
 <20150328234116.GJ23643@zxy.spb.ru>
 <CAJ-VmokSHHm3kMwz=bp7VbgZwADD2_pEr27NdzUfkGq1U=x_sw@mail.gmail.com>
 <20150329003354.GK23643@zxy.spb.ru>
Date: Sat, 28 Mar 2015 22:46:54 -0700
X-Google-Sender-Auth: YPCnY1-85xH-vpjeP6xivneSz18
Message-ID: <CAJ-VmonmeeTaSOpSJCJTP7yTeno1LEt-dt9wEhNHk36oY6yY7Q@mail.gmail.com>
Subject: Re: irq cpu binding
From: Adrian Chadd <adrian@freebsd.org>
To: Slawa Olhovchenkov <slw@zxy.spb.ru>
Content-Type: text/plain; charset=UTF-8
Cc: "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 29 Mar 2015 05:46:55 -0000

On 28 March 2015 at 17:33, Slawa Olhovchenkov <slw@zxy.spb.ru> wrote:
> On Sat, Mar 28, 2015 at 04:58:53PM -0700, Adrian Chadd wrote:
>
>> Hi,
>>
>> * It turns out that fragments were being 100% handled out of order
>> (compared to non-fragments in the same stream) when doing fragment
>> reassembly, because the current system was assuming direct dispatch
>> netisr and not checking any packet contents for whether they're on the
>> wrong CPU. I checked. It's not noticable unless you go digging, but
>> it's absolutely happening. That's why I spun a lot of cycles looking
>> at the IP fragment reassembly path and which methods get called on the
>> frames as they're reinjected.
>
> In case of fragmented packet we have first fragment (may be arrived
> not first) contained L4 information and dispatchet to correct bucket
> and other fragments, don't contains this information and dispathed
> anywere. As I understund IP stack gather all packet before processing.
> All we need -- do processing on CPU arriving first segment.

I'm pretty sure that wasn't what was happening when i went digging. I
was using UDP and varying the transmit size so I had exact control
over the fragmentation.

The driver rx path does direct dispatch netisr processing, and for
fragments it was hashed on only L3 details not L4. Even the first
frame is hashed on L3 only. So it'd go to a different queue compared
to L4 hashing, and subsequent fragments would come in on the same
queue. Once it was completed, it was processed up inline - it wasn't
going back into netisr and getting re-checked for the right queue.

>> * We're going to have modify drivers, because the way drivers
>> currently assign interrupts, pick CPUs for queues, auto-select how
>> many queues to use, etc is all completely adhoc and not consistent. So
>
> Yes. I don't see problem (except re-binding IRQ by cpuset).
> All interesting drivers give tunable to control how many queues to
> use. I don't know how automate this:
>
> - one 1-port card
> - one 2-port card
> - one port of 2-port card
> - two 1-port card
> - two different card
> ....
>
> Manual select is aceptable here.
>
>> yeah, we're going to change the drivers and they're going to be
>> consistent and configurable. That way you can choose how you want to
>> distribute work and pin or not pin things - and it's not done adhoc
>> differently in each driver. Even igb, ixgbe and cxgbe differ in how
>> they implement these three things.
>>
>> * For RSS, there'll be a consistent configuration for what the
>> hardware is doing with hashing, rather than it being driver dependent.
>> Again, otherwise you may end up with some NICs doing 2-tuple hashing
>> where others are doing 4-tuple hashing, and behaviour changes
>> dramatically based on what NIC you're using.
>
> What's problem there?
> I am don't intersting how NIC do hashing (anyway, hashing for direct
> and reflex traffic is different -- this is not Tilera).
> All I need -- distributing flow to CPU, for balance load and reduction
> lock congenstion.

Right, but you assume all packets in a flow go to the same CPU, and I
discovered this wasn't the case.
That's why I went down the path with RSS to make it right.

>
>> * For applications - I'm not sure yet, but at the minimum the librss
>> API I have vaguely sketched out and coded up in a git branch lets you
>> pull out the list of buckets and which CPU it's on. I'm going to
>> extend that a bit more, but it should be enough for things like nginx
>> to say "ok, start up one nginx process per RSS bucket, and here's the
>> CPU set for it to bind to." You said it has worker groups - that's
>> great; I want that to be auto configured.
>
> For applications minimum is (per socket) select/kqueut/accept work
> only for flow, arrived at CPU matched CPU at time select/kqueut/accept
> (yes, for correct work application must pined to this CPU).
>
> And application don't need know anything about buckets and etc.
>
> After this, arrived packet activated IRQ handler, ithread, driver
> interrup thread, TCP stack, select/accept, read, write, tcp_output --
> all on same cpu. I can be wrong, this is save L2/L3 cache.
>
> Where I missunderstund?

The other half of the network stack - the sending side - also needs to
be either on the same or nearby CPU, or you still end up with lock
contention and cache thrashing.


-a