From owner-freebsd-net@freebsd.org  Fri Jul 22 19:23:37 2016
Return-Path: <owner-freebsd-net@freebsd.org>
Delivered-To: freebsd-net@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 80F44BA1C9B
 for <freebsd-net@mailman.ysv.freebsd.org>;
 Fri, 22 Jul 2016 19:23:37 +0000 (UTC)
 (envelope-from adrian.chadd@gmail.com)
Received: from mail-io0-x22f.google.com (mail-io0-x22f.google.com
 [IPv6:2607:f8b0:4001:c06::22f])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 4ECBD1E4A
 for <freebsd-net@freebsd.org>; Fri, 22 Jul 2016 19:23:37 +0000 (UTC)
 (envelope-from adrian.chadd@gmail.com)
Received: by mail-io0-x22f.google.com with SMTP id b62so113617831iod.3
 for <freebsd-net@freebsd.org>; Fri, 22 Jul 2016 12:23:37 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:from:date:message-id
 :subject:to:cc;
 bh=nL76gfArXyNS5cIl4Is1mwrmdVjBY+lEVdIlEJuopwg=;
 b=xMxAcJAW22pnMURVXHJo7OzThYn1OWVuEvWYjKkuhBMAiPW+QZWyiOXdVTjB1l8jFo
 lwP0wM+35LRSWjBnclKBo31fFzWVs7A64U4gL5WS2DXR7BZPap1x5WHXsIRZDpxmykuz
 j8MllznwA4TkN8rApoSREAvMwkiP3yh3G5HEZm7HkyctrIdx6FSgED0VoN9oiuor4/2M
 GhX9E/aPmtBp8Zknaf39Ks6GkbfXf+EmAm+zJ20pS+EOtog7XorZlqDaRfQBm5p0mjOo
 GQsA662EykaGjnNnoTRmWZ6Yh5+NRLr03fRcplSxSJuyCgVBdnskLnZayR/0mxh5wu+x
 PfDA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:mime-version:sender:in-reply-to:references:from
 :date:message-id:subject:to:cc;
 bh=nL76gfArXyNS5cIl4Is1mwrmdVjBY+lEVdIlEJuopwg=;
 b=bCg0uaEEArzbVXPExyBYxRS3yfo+ROy+zH+luVX53amm7BSLNOu34ipjg6Yx5V9lg0
 6dIyPz67kJaboT/0VrwrVYhPQZ55mmlXKLIGH1w6mC6p6WeQFaq1QZ32VdJZmzJqZQCt
 bav23E6qhwpXmx2cwy3B5XPTyjbwb5b7dEiwRwwdRTfJBzU/aS6lOlABTBTzEkYdM2T0
 RbDaXYCN7F4rXom3jIhbk9v/4z/Q7ZGNrUlw76XfQ2Rb1uSOuUP+t7yWLkP+ZTTTTH9Y
 DqinP2naZtICPl7wyTduQwuriOwMYlKl+lRhB55194d6dkbYh6wMJ57rEgfD91cMYeIG
 CgoA==
X-Gm-Message-State: AEkoouuI+1zGBOMcXil35FCY11WN28ORlbpwKaYxM2UxuStPTJvI4G+nj17eEXiszATn+ppOlGfEmUT0U18T+Q==
X-Received: by 10.107.13.70 with SMTP id 67mr6519163ion.75.1469215416621; Fri,
 22 Jul 2016 12:23:36 -0700 (PDT)
MIME-Version: 1.0
Sender: adrian.chadd@gmail.com
Received: by 10.36.141.129 with HTTP; Fri, 22 Jul 2016 12:23:35 -0700 (PDT)
In-Reply-To: <CAMOc5cxEWqWOMPSXFe3=N5S93bs8RO-XX22QghtHd8vC5xuNjA@mail.gmail.com>
References: <CAJ-Vmo=Wj3ZuC6mnVCxonQ74nfEmH7CE=TP3xhLzWifdBxxfBQ@mail.gmail.com>
 <306af514-70ff-f3bf-5b4f-da7ac1ec6580@cs.duke.edu>
 <CAJ-VmomHYVCknVkDLF+b8Gc5wBWxkddEMY3dhvbxJihLZHyTLg@mail.gmail.com>
 <CAMOc5cxEWqWOMPSXFe3=N5S93bs8RO-XX22QghtHd8vC5xuNjA@mail.gmail.com>
From: Adrian Chadd <adrian@freebsd.org>
Date: Fri, 22 Jul 2016 12:23:35 -0700
X-Google-Sender-Auth: M4J2YYsd2MsUcoW01BbDvmSLtIA
Message-ID: <CAJ-Vmomj5XjtmqbTukmxqdiF_A-Ga1jFMA5r24=CXcG0gueYsg@mail.gmail.com>
Subject: Re: proposal: splitting NIC RSS up from stack RSS
To: Sepherosa Ziehau <sepherosa@gmail.com>
Cc: Andrew Gallatin <gallatin@cs.duke.edu>,
 FreeBSD Net <freebsd-net@freebsd.org>
Content-Type: text/plain; charset=UTF-8
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 22 Jul 2016 19:23:37 -0000

On 21 July 2016 at 18:54, Sepherosa Ziehau <sepherosa@gmail.com> wrote:
> On Fri, Jul 22, 2016 at 6:39 AM, Adrian Chadd <adrian@freebsd.org> wrote:
>> hi,
>>
>> Cool! Yeah, the RSS bits thing can be removed, as it's just doing a
>> bitmask instead of a % operator to do mapping. I think we can just go
>> to % and if people need the extra speed from a power-of-two operation,
>> they can reintroduce it.
>
> I thought about it a while ago (the most popular E5-2560v{1,2,3} only
> has 6 cores, but E5-2560v4 has 8 cores! :).  Since the raw RSS hash
> value is '& 0x1f' (I believe most of the NICs use 128 entry indirect
> table as defined by MS RSS) to select an entry in the indirect table,
> simply '%' on the raw RSS hash value probably will not work properly;
> you will need (hash&0x1f)%mp_ncpus at least.  And well, since the
> indirect table's size if 128, you still will get some uneven CPU
> workload for non-power-of-2 cpus.  And if you take cpu affinity into
> consideration, the situation will be even more complex ...

Hi,

Sure. The biggest annoying part is that a lot of the kernel
infrastructure for queueing packets (netisr) and scheduling stack work
(callouts) are indexed on CPU, not on "thing". If it was indexed on
"thing" then we could do a two stage work redistribution method that'd
scale O(1):

* packets get plonked into "thing" via some mapping table - eg, map
128 or 256 buckets to queues that do work / schedule call outs /
netisr; and
* the queues aren't tied to a CPU at this point, and it can get
shuffled around by using cpumasks.

It'd be really, really nice IMHO if we had netisr and callouts be
"thing" based rather than "cpu" based, so we could just shift work by
changing the CPU mask - then we don't have to worry about rescheduling
packets or work onto the new CPU when we want to move load around.
That doesn't risk out of order packet handling behaviour and it means
we can (in theory!) put a given RSS bucket into more than one CPU, for
things like TCP processing.

Trouble is, this is somewhat contentious. I could do the netisr change
without upsetting people, but the callout code honestly makes me want
to set everything (in sys/kern) on fire and start again. After all of
the current issues with the callout subsystem I kind of just want to
see hps finish his work and land it into head, complete with more
sensible lock semantics, before I look at breaking it out to not be
per-CPU based but instead allow subsystems to create their own worker
pools for callouts. I'm sure NFS and CAM would like this kind of thing
too.

Since people have asked me about this in the past, the side effect of
support dynamic hash mapping (even in software) is that for any given
flow, once you change the hash mapping you will have some packets in
said flow in the old queue and some packets in the new queue. For
things like stack TCP/UDP where it's using pcbgroups it can vary from
being slow to (eventually, when the global list goes away) plainly not
making it to the right pcb/socket, which is okay for some workloads
and not for others. That may be a fun project to work on once the
general stack / driver tidyups are done, but I'm going to resist doing
it myself for a while because it'll introduce the above uncertainties
which will cause out-of-order behaviour that'll likely generate more
problem reports than I want to handle.

(Read: since I'm doing this for free, I'm not going to do anything
risky, as I'm not getting paid to wade through the repercussions just
right now.)

FWIW, we had this same problem in ye olde past with squid and WCCP
with its hash based system. Squid's WCCP implementation was simple and
static. The commercial solutions (read: cisco, etc) implemented
handling the cache set changing / hash traffic map changing by having
the caches redirect traffic to the /old/ cache whenever the hash or
cache set changed. Squid didn't do this out of the box, so if the
cache topology changed it would send traffic to the wrong box and the
existing connections would break.


-adrian