From owner-svn-src-all@freebsd.org Fri Jan 5 23:30:06 2018 Return-Path: Delivered-To: svn-src-all@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 85381EA6C40; Fri, 5 Jan 2018 23:30:06 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from out2-smtp.messagingengine.com (out2-smtp.messagingengine.com [66.111.4.26]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4906665705; Fri, 5 Jan 2018 23:30:05 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from compute6.internal (compute6.nyi.internal [10.202.2.46]) by mailout.nyi.internal (Postfix) with ESMTP id 7F63121F7D; Fri, 5 Jan 2018 18:30:04 -0500 (EST) Received: from frontend2 ([10.202.2.161]) by compute6.internal (MEProxy); Fri, 05 Jan 2018 18:30:04 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=samsco.org; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to:x-me-sender :x-me-sender:x-sasl-enc; s=fm2; bh=NYvNEetVYIXvZxcxJhCY/6HK1yIF1 S+0Yi3ujabkLyc=; b=F1xbOgeKhwL1woJWSltPKHiJSQllreZU+gTgctjH++GYH FSkwqe0IzX6ev61M5z6n7/KPmKq8diaPwdzpMz/CTp5VmhyE0LxNCIO2Tl0a2b7r gJWp6dBNkbN3GjWHcdwGeowyjrSFgeovuWktuf0VJVAcm2FUxy0SeWrS16p3DDme B8iLIgMdo+XcBf6G+ApOPfE0/5ufY2Mssb2CLj6bVLYedkzjZygf+PphPH9fKJcS CQYkLlZ2vVDkrkmr1KwYPKvdGIG/maOACsANKXjm5PVhq8m7nHbawkfiXU81FdA5 BBHRBX1KHWrrS/pMyw1QbNr25h0GNHg3Hb0kKbgvg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:from:in-reply-to:message-id:mime-version:references :subject:to:x-me-sender:x-me-sender:x-sasl-enc; s=fm1; bh=NYvNEe tVYIXvZxcxJhCY/6HK1yIF1S+0Yi3ujabkLyc=; b=YZu9sSjHDqxEWrMXHZEHD5 Df1QOtHVGp9qM2BlUAKn4JN5NUdqBF4bb8hZM8ZrCGSlFjdHKmd0CXPmohs3l/HK Y7tLxFKeq8In6MzJxEkK//8H59kpt6VGMFgyBnDgz4rkXyFlhFPOQmEzBVQSe/aS tvCP0TCB3fORMN9RlzuQ3uHkjCrfry4aGzGYQWhwH+Oq0YaaV7U5fM5n1/YtjfJ7 MFgsa6UTcb+MWKjukfstMFR4jGw3DnGSf5Fu4Vbk4/izTHLmlkA4dVs7b2ie4sQD ecHs4bqqTE8+nGfCud39sjzQCPToKoWu3SPFLbjOR+3tPicHspTtPkxUN5SUjvKA == X-ME-Sender: Received: from [192.168.0.106] (unknown [161.97.249.191]) by mail.messagingengine.com (Postfix) with ESMTPA id AF98524599; Fri, 5 Jan 2018 18:30:03 -0500 (EST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 11.1 \(3445.4.7\)) Subject: Re: svn commit: r327559 - in head: . sys/net From: Scott Long In-Reply-To: <5A4FC1E4.9060301@grosbein.net> Date: Fri, 5 Jan 2018 16:30:00 -0700 Cc: Matt Joras , Steven Hartland , src-committers , svn-src-all@freebsd.org, svn-src-head@freebsd.org, scottl@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: References: <201801042005.w04K5liB049411@repo.freebsd.org> <5A4E9397.9000308@grosbein.net> <5A4EDC62.50508@grosbein.net> <5A4F824C.1060405@grosbein.net> <97d173fb-4f47-609d-8319-07282a283473@multiplay.co.uk> <5A4FB6BC.6090506@grosbein.net> <5A4FC1E4.9060301@grosbein.net> To: Eugene Grosbein X-Mailer: Apple Mail (2.3445.4.7) X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 05 Jan 2018 23:30:06 -0000 > On Jan 5, 2018, at 11:20 AM, Eugene Grosbein = wrote: >=20 > CC'ng scottl@ as author of the change in question. >=20 > 06.01.2018 0:39, Matt Joras wrote: >=20 >>>> For what it's worth, this was the conclusion I came to, and at = Isilon >>>> we've made the same change being discussed here. For the case of >>>> drivers that end up using a queue index for the flowid, you end up >>>> with pathological behavior on the lagg; the flowid ends up getting >>>> right shifted by (default) 16. So in the case of e.g. two bxe(4) >>>> interfaces with 4 queues, you always end up choosing the interface = in >>>> the lagg at index 0. >=20 > Then why does if_lagg shifts 16 bits by default? Is seems senseless. > This was introduced with r260070 by scottl: >=20 At the time, we were using cxgbe interfaces which inserted a reasonable = RSS hash into the flowid field. The shift was done to expose different bits = to the index/modulo scheme used by the LACP module vs the cxgbe module. In hindsight, I should not have set a default value of 16, I should have = left it at zero so that default behavior would be preserved. >> Multi-queue NIC drivers and multi-port lagg tend to use the same = lower >> bits of the flowid as each other, resulting in a poor distribution of >> packets among queues in certain cases. Work around this by adding a >> set of sysctls for controlling a bit-shift on the flowid when doing >> multi-port aggrigation in lagg and lacp. By default, lagg/lacp will >> now use bits 16 and higher instead of 0 and higher. >>=20 >> Reviewed by: max >> Obtained from: Netflix >> MFC after: 3 days >=20 > This commit message does not point to an example of NIC driver that = would set > non-zero bits 16 and higher for flowid so that shift result would be = non-zero. > Do we really have such a driver? >=20 Yes. > Anyway, this lagg's default seems to be very driver-centric. >=20 > For example, Intel driver family also do not use such high bits for = flowid > just like mentioned bxe(4). >=20 scottl@moe:~/svn/head/sys/dev % grep -r iri_flowid * bnxt/bnxt_txrx.c: ri->iri_flowid =3D = le32toh(rcp->rss_hash); bnxt/bnxt_txrx.c: ri->iri_flowid =3D = le32toh(tpas->low.rss_hash); e1000/em_txrx.c: ri->iri_flowid =3D = le32toh(rxd->wb.lower.hi_dword.rss); e1000/igb_txrx.c: ri->iri_flowid =3D ixgbe/ix_txrx.c: ri->iri_flowid =3D = le32toh(rxd->wb.lower.hi_dword.rss); The number of drivers that set m_pkhhdr.flowid directly to an RSS hash = looks to be: cxgb cxgbe mlx4 mlx5 qlnx qlxgbe qlxge vmxnet3 Maybe the hardware doesn=E2=80=99t do a great job with generating a = useful RSS hash, but that=E2=80=99s tangential to this conversation. > We should change flowid_shift default to 0 for if_lagg(4), shouldn't = we? >=20 In the short term, yes. What I see is that it=E2=80=99s too expensive = to do a hash calculation on every TX packet in LACP (for anything resembling line rate), and = flowid is unreliable when a connection is initiated without an RX packet triggering it. I = don=E2=80=99t understand the consequences of the TX initiation problem well enough to offer a = solution. For the problem of flowid being used inconsistently by drivers (i.e. not = populating all 32 bits or using a weak RSS), that=E2=80=99s really a driver problem. What I=E2=80=99d recommend is that the LACP code check for = M_HASHTYPE_NONE and M_HASHTYPE_OPAQUE and calculate a new hash if either are set = (effectively ignoring the flowid). It=E2=80=99s then up to the drivers to set the = correct hash type that corresponds with what they=E2=80=99re putting into the flowid field. An = opaque type would mean that the value is useful to the driver but should not be considered = useful anywhere else. You=E2=80=99ll get more correct and less surprising = behavior from that, at the penalty of every TX packet requiring a hash calculation for = hardware/drivers that are crummy. Scott