Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 5 Jan 2018 16:30:00 -0700
From:      Scott Long <scottl@samsco.org>
To:        Eugene Grosbein <eugen@grosbein.net>
Cc:        Matt Joras <mjoras@freebsd.org>, Steven Hartland <steven@multiplay.co.uk>, src-committers <src-committers@freebsd.org>, svn-src-all@freebsd.org, svn-src-head@freebsd.org, scottl@freebsd.org
Subject:   Re: svn commit: r327559 - in head: . sys/net
Message-ID:  <F8706640-0C0C-42F2-92C4-D8CEA18606EF@samsco.org>
In-Reply-To: <5A4FC1E4.9060301@grosbein.net>
References:  <201801042005.w04K5liB049411@repo.freebsd.org> <5A4E9397.9000308@grosbein.net> <f133b587-1f7e-4594-31d1-974775ad55be@freebsd.org> <5A4EDC62.50508@grosbein.net> <c42bfb9e-e583-aca3-bf0d-4d92c0153d2f@multiplay.co.uk> <5A4F824C.1060405@grosbein.net> <97d173fb-4f47-609d-8319-07282a283473@multiplay.co.uk> <CADdTf%2Bi47-X1HVb=JBFi53wOgbX1x1euHr53-YyqNDDFNF9PLw@mail.gmail.com> <5A4FB6BC.6090506@grosbein.net> <CADdTf%2BjJVn_X6UNywnis-KXq6Dua7B8F=2kuND_N6bJ1ZH0D_A@mail.gmail.com> <5A4FC1E4.9060301@grosbein.net>

next in thread | previous in thread | raw e-mail | index | archive | help


> On Jan 5, 2018, at 11:20 AM, Eugene Grosbein <eugen@grosbein.net> =
wrote:
>=20
> CC'ng scottl@ as author of the change in question.
>=20
> 06.01.2018 0:39, Matt Joras wrote:
>=20
>>>> For what it's worth, this was the conclusion I came to, and at =
Isilon
>>>> we've made the same change being discussed here. For the case of
>>>> drivers that end up using a queue index for the flowid, you end up
>>>> with pathological behavior on the lagg; the flowid ends up getting
>>>> right shifted by (default) 16. So in the case of e.g. two bxe(4)
>>>> interfaces with 4 queues, you always end up choosing the interface =
in
>>>> the lagg at index 0.
>=20
> Then why does if_lagg shifts 16 bits by default? Is seems senseless.
> This was introduced with r260070 by scottl:
>=20

At the time, we were using cxgbe interfaces which inserted a reasonable =
RSS
hash into the flowid field.  The shift was done to expose different bits =
to the
index/modulo scheme used by the LACP module vs the cxgbe module.  In
hindsight, I should not have set a default value of 16, I should have =
left it at
zero so that default behavior would be preserved.

>> Multi-queue NIC drivers and multi-port lagg tend to use the same =
lower
>> bits of the flowid as each other, resulting in a poor distribution of
>> packets among queues in certain cases.  Work around this by adding a
>> set of sysctls for controlling a bit-shift on the flowid when doing
>> multi-port aggrigation in lagg and lacp.  By default, lagg/lacp will
>> now use bits 16 and higher instead of 0 and higher.
>>=20
>> Reviewed by:    max
>> Obtained from:  Netflix
>> MFC after:      3 days
>=20
> This commit message does not point to an example of NIC driver that =
would set
> non-zero bits 16 and higher for flowid so that shift result would be =
non-zero.
> Do we really have such a driver?
>=20

Yes.

> Anyway, this lagg's default seems to be very driver-centric.
>=20
> For example, Intel driver family also do not use such high bits for =
flowid
> just like mentioned bxe(4).
>=20

scottl@moe:~/svn/head/sys/dev % grep -r iri_flowid *
bnxt/bnxt_txrx.c:		ri->iri_flowid =3D =
le32toh(rcp->rss_hash);
bnxt/bnxt_txrx.c:		ri->iri_flowid =3D =
le32toh(tpas->low.rss_hash);
e1000/em_txrx.c:	ri->iri_flowid =3D =
le32toh(rxd->wb.lower.hi_dword.rss);
e1000/igb_txrx.c:	ri->iri_flowid =3D
ixgbe/ix_txrx.c:	ri->iri_flowid =3D =
le32toh(rxd->wb.lower.hi_dword.rss);

The number of drivers that set m_pkhhdr.flowid directly to an RSS hash =
looks
to be:

cxgb
cxgbe
mlx4
mlx5
qlnx
qlxgbe
qlxge
vmxnet3

Maybe the hardware doesn=E2=80=99t do a great job with generating a =
useful RSS hash,
but that=E2=80=99s tangential to this conversation.

> We should change flowid_shift default to 0 for if_lagg(4), shouldn't =
we?
>=20

In the short term, yes.  What I see is that it=E2=80=99s too expensive =
to do a hash calculation
on every TX packet in LACP (for anything resembling line rate), and =
flowid is unreliable
when a connection is initiated without an RX packet triggering it.  I =
don=E2=80=99t understand
the consequences of the TX initiation problem well enough to offer a =
solution.  For the
problem of flowid being used inconsistently by drivers (i.e. not =
populating all 32 bits
or using a weak RSS), that=E2=80=99s really a driver problem.

What I=E2=80=99d recommend is that the LACP code check for =
M_HASHTYPE_NONE and
M_HASHTYPE_OPAQUE and calculate a new hash if either are set =
(effectively
ignoring the flowid).  It=E2=80=99s then up to the drivers to set the =
correct hash type that
corresponds with what they=E2=80=99re putting into the flowid field.  An =
opaque type would
mean that the value is useful to the driver but should not be considered =
useful
anywhere else.  You=E2=80=99ll get more correct and less surprising =
behavior from that, at
the penalty of every TX packet requiring a hash calculation for =
hardware/drivers
that are crummy.

Scott






Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?F8706640-0C0C-42F2-92C4-D8CEA18606EF>