Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 17 Jan 2014 15:49:05 -0800
From:      Adrian Chadd <adrian@freebsd.org>
To:        "Alexander V. Chernikov" <melifaro@freebsd.org>
Cc:        "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>
Subject:   Re: ECMP hash keys?
Message-ID:  <CAJ-Vmom2kK_=GOtTLuZ%2BoyUbOSP_p0JhRvnmH1h2KD1hGZ78UQ@mail.gmail.com>
In-Reply-To: <52D996FD.6090901@FreeBSD.org>
References:  <52D5138B.8050100@fsn.hu> <CA%2BP_MZFQU4%2B05Pk5cZ4NMZujD9vXDrV=mehN7_vz1OZ6r2-f1Q@mail.gmail.com> <52D6525D.50102@FreeBSD.org> <CAJ-VmomP-JaVopS0aneeV82OFtM1Pvb=qKn__mn=ooDXOdgmQw@mail.gmail.com> <52D84DB0.4050607@FreeBSD.org> <CAJ-Vmom7ui1_vKZnp3PLfmEdF62Eheph2Nj2t38_mUA%2B2WMEZA@mail.gmail.com> <52D996FD.6090901@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On 17 January 2014 12:47, Alexander V. Chernikov <melifaro@freebsd.org> wrote:
> On 17.01.2014 02:08, Adrian Chadd wrote:
>> The reason you need to make sure that you end up with hashes for both
>> src,dst and dst,src being equivalent is to ensure that when you create
>> an outbound socket, you know up front which path the receive path is
>> going to come back on. Right now we don't mark new connections -
>> inbound or outbound - with a flowid until we've received some data on
>> it.
> Well, this seems reasonable.
>
> However, how do you plan to interact with hardware RSS?

Well, if it's doing Toeplitz in hardware, we'll just use that.
DragonflyBSD does this. They program the RSS registers on startup to
map parts of the RSS space to CPUs as required.

But if it isn't, we will have to do our own toeplitz hashing in software.

I thought the majority of NICs these day do the topelitz calculation
in hardware anyway.

> For example, currently Intel used to set flowid to cpu number (which can
> be reasonable in some cases). Afair 82599 advanced RX descriptor
> contains original value that can be extracted, but we can't change cpu
> on which packet arrives on (well, we can reprogram indirection table, but..)

Well, that's the point, right?

> I can't see any easy way to accomplish custom SW RSS:
>
> We can possibly have 1-2-4 ingress HW queues per NIC, ignore driver
> flowid, re-calculate with modified Toeplitz or similar and push to other
> ncpu-1 neisr queues. That can work, but requires custom setup
> (especially for lagg scenarios) and works well for small subset of
> workloads.

Well, lagg is the same but different.

Ie, we still choose the outbound TX queue on _a_ NIC based on the
CPU/netisr derived from flowid.

But the outbound NIC has to be chosen a different way or you end up
with sub-optimal TX queue selection. Scott found this @ Netflix and
this is why lagg now doesn't use the low bits of the flowid when it
chooses which port to send _out_ on.

> It seems also guessing ingress flowid is not very much different between
> symmetric and asymmetric hashing approaches.

I think the problem here is that flowid has been a mostly-opaque value
for way too long.

I like the dragonflybsd approach - they added a hashid, not flowid,
and the netisr path checks to see if the driver has stamped it with a
hardware toeplitz hashid or not. If not, it does its own hashing and
punts the frame to the correct netisr RX queue on the right CPU.

For routing it may not matter as much- we could just short-circuit
that so it runs on the current CPU all the way to transmit.

For NAT, it may be worthwhile keeping the per-flow state local on a
given CPU to exploit various cache/lock coherencies.

I guess the fall out from all of this is that I'd rather we had better
specified things like "what is flowid", "how can we specify affinity",
etc, so we can use it if we want, and not use it if we don't. Right
now we have a "kind of but not quite done" way of affinity, enough to
mostly not break TCP/UDP flow ordering, but not enough to really
exploit affinity.



-a



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-Vmom2kK_=GOtTLuZ%2BoyUbOSP_p0JhRvnmH1h2KD1hGZ78UQ>