Date: Sat, 6 Jan 2018 00:39:28 +0000 From: Steven Hartland <steven@multiplay.co.uk> To: Scott Long <scottl@samsco.org>, Eugene Grosbein <eugen@grosbein.net> Cc: Matt Joras <mjoras@freebsd.org>, src-committers <src-committers@freebsd.org>, svn-src-all@freebsd.org, svn-src-head@freebsd.org, scottl@freebsd.org Subject: Re: svn commit: r327559 - in head: . sys/net Message-ID: <8e86a984-b9f8-1f45-f4f6-61b4470afae1@multiplay.co.uk> In-Reply-To: <F8706640-0C0C-42F2-92C4-D8CEA18606EF@samsco.org> References: <201801042005.w04K5liB049411@repo.freebsd.org> <5A4E9397.9000308@grosbein.net> <f133b587-1f7e-4594-31d1-974775ad55be@freebsd.org> <5A4EDC62.50508@grosbein.net> <c42bfb9e-e583-aca3-bf0d-4d92c0153d2f@multiplay.co.uk> <5A4F824C.1060405@grosbein.net> <97d173fb-4f47-609d-8319-07282a283473@multiplay.co.uk> <CADdTf%2Bi47-X1HVb=JBFi53wOgbX1x1euHr53-YyqNDDFNF9PLw@mail.gmail.com> <5A4FB6BC.6090506@grosbein.net> <CADdTf%2BjJVn_X6UNywnis-KXq6Dua7B8F=2kuND_N6bJ1ZH0D_A@mail.gmail.com> <5A4FC1E4.9060301@grosbein.net> <F8706640-0C0C-42F2-92C4-D8CEA18606EF@samsco.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 05/01/2018 23:30, Scott Long wrote: > >> On Jan 5, 2018, at 11:20 AM, Eugene Grosbein <eugen@grosbein.net> wrote: >> >> CC'ng scottl@ as author of the change in question. >> >> 06.01.2018 0:39, Matt Joras wrote: >> >>>>> For what it's worth, this was the conclusion I came to, and at Isilon >>>>> we've made the same change being discussed here. For the case of >>>>> drivers that end up using a queue index for the flowid, you end up >>>>> with pathological behavior on the lagg; the flowid ends up getting >>>>> right shifted by (default) 16. So in the case of e.g. two bxe(4) >>>>> interfaces with 4 queues, you always end up choosing the interface in >>>>> the lagg at index 0. >> Then why does if_lagg shifts 16 bits by default? Is seems senseless. >> This was introduced with r260070 by scottl: >> > At the time, we were using cxgbe interfaces which inserted a reasonable RSS > hash into the flowid field. The shift was done to expose different bits to the > index/modulo scheme used by the LACP module vs the cxgbe module. In > hindsight, I should not have set a default value of 16, I should have left it at > zero so that default behavior would be preserved. > >>> Multi-queue NIC drivers and multi-port lagg tend to use the same lower >>> bits of the flowid as each other, resulting in a poor distribution of >>> packets among queues in certain cases. Work around this by adding a >>> set of sysctls for controlling a bit-shift on the flowid when doing >>> multi-port aggrigation in lagg and lacp. By default, lagg/lacp will >>> now use bits 16 and higher instead of 0 and higher. >>> >>> Reviewed by: max >>> Obtained from: Netflix >>> MFC after: 3 days >> This commit message does not point to an example of NIC driver that would set >> non-zero bits 16 and higher for flowid so that shift result would be non-zero. >> Do we really have such a driver? >> > Yes. > >> Anyway, this lagg's default seems to be very driver-centric. >> >> For example, Intel driver family also do not use such high bits for flowid >> just like mentioned bxe(4). >> > scottl@moe:~/svn/head/sys/dev % grep -r iri_flowid * > bnxt/bnxt_txrx.c: ri->iri_flowid = le32toh(rcp->rss_hash); > bnxt/bnxt_txrx.c: ri->iri_flowid = le32toh(tpas->low.rss_hash); > e1000/em_txrx.c: ri->iri_flowid = le32toh(rxd->wb.lower.hi_dword.rss); > e1000/igb_txrx.c: ri->iri_flowid = > ixgbe/ix_txrx.c: ri->iri_flowid = le32toh(rxd->wb.lower.hi_dword.rss); > > The number of drivers that set m_pkhhdr.flowid directly to an RSS hash looks > to be: > > cxgb > cxgbe > mlx4 > mlx5 > qlnx > qlxgbe > qlxge > vmxnet3 > > Maybe the hardware doesn’t do a great job with generating a useful RSS hash, > but that’s tangential to this conversation. > >> We should change flowid_shift default to 0 for if_lagg(4), shouldn't we? >> > In the short term, yes. What I see is that it’s too expensive to do a hash calculation > on every TX packet in LACP (for anything resembling line rate), and flowid is unreliable > when a connection is initiated without an RX packet triggering it. I don’t understand > the consequences of the TX initiation problem well enough to offer a solution. For the > problem of flowid being used inconsistently by drivers (i.e. not populating all 32 bits > or using a weak RSS), that’s really a driver problem. > > What I’d recommend is that the LACP code check for M_HASHTYPE_NONE and > M_HASHTYPE_OPAQUE and calculate a new hash if either are set (effectively > ignoring the flowid). It’s then up to the drivers to set the correct hash type that > corresponds with what they’re putting into the flowid field. An opaque type would > mean that the value is useful to the driver but should not be considered useful > anywhere else. You’ll get more correct and less surprising behavior from that, at > the penalty of every TX packet requiring a hash calculation for hardware/drivers > that are crummy. Mixing the hash methods causes problems with out of order packets even just for the first packet, and using a hash which is not what's configured by lagghash is confusing at best so that could be fixed to say "flowid" if that's whats going to happen or perhaps update it to the hash type that flowid represents if that's possible. LACP already checks for M_HASHTYPE_NONE if use_flowid is enabled and manually calculates a hash, which is what leads the the first packet port selection inconsistency. It's not clear what all the implications of the first packet port inconsistency is, it will likely be dependent a large number of factors including network hardware, layout and dest host + config., but when we've seen this in the 3 and 4 packet of a stream it leads to the destination sending RST, dropping the stream on the floor for 2% of all streams on our test box with 2 x ixgbe interfaces. Questions: 1. Is it possible to determine the hash method used by the HW and use that for all first packets? 2. Is it possible to significantly improve the performance the CPU hashing? 3. Is it possible to tell from the mbuf that its part of a connection instigated from the current host? Regards Steve
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?8e86a984-b9f8-1f45-f4f6-61b4470afae1>