Date: Fri, 5 Jan 2018 16:30:00 -0700 From: Scott Long <scottl@samsco.org> To: Eugene Grosbein <eugen@grosbein.net> Cc: Matt Joras <mjoras@freebsd.org>, Steven Hartland <steven@multiplay.co.uk>, src-committers <src-committers@freebsd.org>, svn-src-all@freebsd.org, svn-src-head@freebsd.org, scottl@freebsd.org Subject: Re: svn commit: r327559 - in head: . sys/net Message-ID: <F8706640-0C0C-42F2-92C4-D8CEA18606EF@samsco.org> In-Reply-To: <5A4FC1E4.9060301@grosbein.net> References: <201801042005.w04K5liB049411@repo.freebsd.org> <5A4E9397.9000308@grosbein.net> <f133b587-1f7e-4594-31d1-974775ad55be@freebsd.org> <5A4EDC62.50508@grosbein.net> <c42bfb9e-e583-aca3-bf0d-4d92c0153d2f@multiplay.co.uk> <5A4F824C.1060405@grosbein.net> <97d173fb-4f47-609d-8319-07282a283473@multiplay.co.uk> <CADdTf%2Bi47-X1HVb=JBFi53wOgbX1x1euHr53-YyqNDDFNF9PLw@mail.gmail.com> <5A4FB6BC.6090506@grosbein.net> <CADdTf%2BjJVn_X6UNywnis-KXq6Dua7B8F=2kuND_N6bJ1ZH0D_A@mail.gmail.com> <5A4FC1E4.9060301@grosbein.net>
next in thread | previous in thread | raw e-mail | index | archive | help
> On Jan 5, 2018, at 11:20 AM, Eugene Grosbein <eugen@grosbein.net> = wrote: >=20 > CC'ng scottl@ as author of the change in question. >=20 > 06.01.2018 0:39, Matt Joras wrote: >=20 >>>> For what it's worth, this was the conclusion I came to, and at = Isilon >>>> we've made the same change being discussed here. For the case of >>>> drivers that end up using a queue index for the flowid, you end up >>>> with pathological behavior on the lagg; the flowid ends up getting >>>> right shifted by (default) 16. So in the case of e.g. two bxe(4) >>>> interfaces with 4 queues, you always end up choosing the interface = in >>>> the lagg at index 0. >=20 > Then why does if_lagg shifts 16 bits by default? Is seems senseless. > This was introduced with r260070 by scottl: >=20 At the time, we were using cxgbe interfaces which inserted a reasonable = RSS hash into the flowid field. The shift was done to expose different bits = to the index/modulo scheme used by the LACP module vs the cxgbe module. In hindsight, I should not have set a default value of 16, I should have = left it at zero so that default behavior would be preserved. >> Multi-queue NIC drivers and multi-port lagg tend to use the same = lower >> bits of the flowid as each other, resulting in a poor distribution of >> packets among queues in certain cases. Work around this by adding a >> set of sysctls for controlling a bit-shift on the flowid when doing >> multi-port aggrigation in lagg and lacp. By default, lagg/lacp will >> now use bits 16 and higher instead of 0 and higher. >>=20 >> Reviewed by: max >> Obtained from: Netflix >> MFC after: 3 days >=20 > This commit message does not point to an example of NIC driver that = would set > non-zero bits 16 and higher for flowid so that shift result would be = non-zero. > Do we really have such a driver? >=20 Yes. > Anyway, this lagg's default seems to be very driver-centric. >=20 > For example, Intel driver family also do not use such high bits for = flowid > just like mentioned bxe(4). >=20 scottl@moe:~/svn/head/sys/dev % grep -r iri_flowid * bnxt/bnxt_txrx.c: ri->iri_flowid =3D = le32toh(rcp->rss_hash); bnxt/bnxt_txrx.c: ri->iri_flowid =3D = le32toh(tpas->low.rss_hash); e1000/em_txrx.c: ri->iri_flowid =3D = le32toh(rxd->wb.lower.hi_dword.rss); e1000/igb_txrx.c: ri->iri_flowid =3D ixgbe/ix_txrx.c: ri->iri_flowid =3D = le32toh(rxd->wb.lower.hi_dword.rss); The number of drivers that set m_pkhhdr.flowid directly to an RSS hash = looks to be: cxgb cxgbe mlx4 mlx5 qlnx qlxgbe qlxge vmxnet3 Maybe the hardware doesn=E2=80=99t do a great job with generating a = useful RSS hash, but that=E2=80=99s tangential to this conversation. > We should change flowid_shift default to 0 for if_lagg(4), shouldn't = we? >=20 In the short term, yes. What I see is that it=E2=80=99s too expensive = to do a hash calculation on every TX packet in LACP (for anything resembling line rate), and = flowid is unreliable when a connection is initiated without an RX packet triggering it. I = don=E2=80=99t understand the consequences of the TX initiation problem well enough to offer a = solution. For the problem of flowid being used inconsistently by drivers (i.e. not = populating all 32 bits or using a weak RSS), that=E2=80=99s really a driver problem. What I=E2=80=99d recommend is that the LACP code check for = M_HASHTYPE_NONE and M_HASHTYPE_OPAQUE and calculate a new hash if either are set = (effectively ignoring the flowid). It=E2=80=99s then up to the drivers to set the = correct hash type that corresponds with what they=E2=80=99re putting into the flowid field. An = opaque type would mean that the value is useful to the driver but should not be considered = useful anywhere else. You=E2=80=99ll get more correct and less surprising = behavior from that, at the penalty of every TX packet requiring a hash calculation for = hardware/drivers that are crummy. Scott
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?F8706640-0C0C-42F2-92C4-D8CEA18606EF>