From owner-freebsd-hackers@FreeBSD.ORG  Sun Mar 29 08:19:04 2015
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id BFBE638B;
 Sun, 29 Mar 2015 08:19:04 +0000 (UTC)
Received: from zxy.spb.ru (zxy.spb.ru [195.70.199.98])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 71699F9C;
 Sun, 29 Mar 2015 08:19:04 +0000 (UTC)
Received: from slw by zxy.spb.ru with local (Exim 4.84 (FreeBSD))
 (envelope-from <slw@zxy.spb.ru>)
 id 1Yc8Qw-0009Ms-Ff; Sun, 29 Mar 2015 11:19:02 +0300
Date: Sun, 29 Mar 2015 11:19:02 +0300
From: Slawa Olhovchenkov <slw@zxy.spb.ru>
To: Adrian Chadd <adrian@freebsd.org>
Subject: Re: irq cpu binding
Message-ID: <20150329081902.GN23643@zxy.spb.ru>
References: <20150328221621.GG23643@zxy.spb.ru>
 <CAJ-Vmomd6Z5Ou7cvV1Kg4m=X2907507hqKMWiz6ssZ45Pi_-Dg@mail.gmail.com>
 <20150328224634.GH23643@zxy.spb.ru>
 <CAJ-VmokwGgHGP6AjBcGbyJShBPX6dyJjjNeCBcjxLi1obaiRtQ@mail.gmail.com>
 <20150328230533.GI23643@zxy.spb.ru>
 <CAJ-VmongWE_z7Rod8-SoFmyiLqiTbHtSaAwjgAs05L_Z3jrWXA@mail.gmail.com>
 <20150328234116.GJ23643@zxy.spb.ru>
 <CAJ-VmokSHHm3kMwz=bp7VbgZwADD2_pEr27NdzUfkGq1U=x_sw@mail.gmail.com>
 <20150329003354.GK23643@zxy.spb.ru>
 <CAJ-VmonmeeTaSOpSJCJTP7yTeno1LEt-dt9wEhNHk36oY6yY7Q@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAJ-VmonmeeTaSOpSJCJTP7yTeno1LEt-dt9wEhNHk36oY6yY7Q@mail.gmail.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-SA-Exim-Connect-IP: <locally generated>
X-SA-Exim-Mail-From: slw@zxy.spb.ru
X-SA-Exim-Scanned: No (on zxy.spb.ru); SAEximRunCond expanded to false
Cc: "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 29 Mar 2015 08:19:04 -0000

On Sat, Mar 28, 2015 at 10:46:54PM -0700, Adrian Chadd wrote:

> >> * It turns out that fragments were being 100% handled out of order
> >> (compared to non-fragments in the same stream) when doing fragment
> >> reassembly, because the current system was assuming direct dispatch
> >> netisr and not checking any packet contents for whether they're on the
> >> wrong CPU. I checked. It's not noticable unless you go digging, but
> >> it's absolutely happening. That's why I spun a lot of cycles looking
> >> at the IP fragment reassembly path and which methods get called on the
> >> frames as they're reinjected.
> >
> > In case of fragmented packet we have first fragment (may be arrived
> > not first) contained L4 information and dispatchet to correct bucket
> > and other fragments, don't contains this information and dispathed
> > anywere. As I understund IP stack gather all packet before processing.
> > All we need -- do processing on CPU arriving first segment.
> 
> I'm pretty sure that wasn't what was happening when i went digging. I
> was using UDP and varying the transmit size so I had exact control
> over the fragmentation.
> 
> The driver rx path does direct dispatch netisr processing, and for
> fragments it was hashed on only L3 details not L4. Even the first
> frame is hashed on L3 only. So it'd go to a different queue compared
> to L4 hashing, and subsequent fragments would come in on the same
> queue. Once it was completed, it was processed up inline - it wasn't
> going back into netisr and getting re-checked for the right queue.

Two case:
1) let this behavior
2) rewrite fo resheduling.

I think 1) acceptable -- fragmented packets very rarely, compared to
target data rate (2Mpps and more).

> > What's problem there?
> > I am don't intersting how NIC do hashing (anyway, hashing for direct
> > and reflex traffic is different -- this is not Tilera).
> > All I need -- distributing flow to CPU, for balance load and reduction
> > lock congenstion.
> 
> Right, but you assume all packets in a flow go to the same CPU, and I
> discovered this wasn't the case.
> That's why I went down the path with RSS to make it right.

Only fragmented packets case or other case?

> >
> >> * For applications - I'm not sure yet, but at the minimum the librss
> >> API I have vaguely sketched out and coded up in a git branch lets you
> >> pull out the list of buckets and which CPU it's on. I'm going to
> >> extend that a bit more, but it should be enough for things like nginx
> >> to say "ok, start up one nginx process per RSS bucket, and here's the
> >> CPU set for it to bind to." You said it has worker groups - that's
> >> great; I want that to be auto configured.
> >
> > For applications minimum is (per socket) select/kqueut/accept work
> > only for flow, arrived at CPU matched CPU at time select/kqueut/accept
> > (yes, for correct work application must pined to this CPU).
> >
> > And application don't need know anything about buckets and etc.
> >
> > After this, arrived packet activated IRQ handler, ithread, driver
> > interrup thread, TCP stack, select/accept, read, write, tcp_output --
> > all on same cpu. I can be wrong, this is save L2/L3 cache.
> >
> > Where I missunderstund?
> 
> The other half of the network stack - the sending side - also needs to
> be either on the same or nearby CPU, or you still end up with lock
> contention and cache thrashing.

For incoming connections this will be automatuc -- sending will be
from CPU binding to receiving queue.

Outgoing connections is more complex case, yes.
Need to transfer FD (with re-binding) and signaling (from kernel to
application) about prefered CPU. Prefered CPU is CPU give SYN-ACK.
And this need assistance from application. But I am currently can't
remember application massive servering outgouing connections.