Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 4 Jan 2018 18:55:20 -0800
From:      Adrian Chadd <adrian.chadd@gmail.com>
To:        Steven Hartland <steven@multiplay.co.uk>
Cc:        hiren panchasara <hiren@strugglingcoder.info>, Eugene Grosbein <eugen@grosbein.net>,  src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org
Subject:   Re: svn commit: r327559 - in head: . sys/net
Message-ID:  <CAJ-Vmom8jOsMnzVX276JRe5kN-OM%2BCH%2B6LKtEL_hCgh6-XD2kg@mail.gmail.com>
In-Reply-To: <63c3c450-aeaf-bdd5-5e16-414146c9bb3a@multiplay.co.uk>
References:  <201801042005.w04K5liB049411@repo.freebsd.org> <5A4E9397.9000308@grosbein.net> <f133b587-1f7e-4594-31d1-974775ad55be@freebsd.org> <20180104224214.GD18879@strugglingcoder.info> <63c3c450-aeaf-bdd5-5e16-414146c9bb3a@multiplay.co.uk>

next in thread | previous in thread | raw e-mail | index | archive | help
does it also happen when you actually enable RSS in the kernel? Since
like I went through a whole lot of pain to assign a flowid at
connection setup time.



-a


On 4 January 2018 at 15:37, Steven Hartland <steven@multiplay.co.uk> wrote:
>
>
> On 04/01/2018 22:42, hiren panchasara wrote:
>
> On 01/04/18 at 09:52P, Steven Hartland wrote:
>
> On 04/01/2018 20:50, Eugene Grosbein wrote:
>
> 05.01.2018 3:05, Steven Hartland wrote:
>
> Author: smh
> Date: Thu Jan  4 20:05:47 2018
> New Revision: 327559
> URL: https://svnweb.freebsd.org/changeset/base/327559
>
> Log:
>    Disabled the use of flowid for lagg by default
>
>    Disabled the use of RSS hash from the network card aka flowid for
>    lagg(4) interfaces by default as it's currently incompatible with
>    the lacp and loadbalance protocols.
>
>    The incompatibility is due to the fact that the flowid isn't know
>    for the first packet of a new outbound stream which can result in
>    the hash calculation method changing and hence a stream being
>    incorrectly split across multiple interfaces during normal
>    operation.
>
>    This can be re-enabled by setting the following in loader.conf:
>    net.link.lagg.default_use_flowid="1"
>
>    Discussed with: kmacy
>    Sponsored by:	Multiplay
>
> RSS by definition has meaning to received stream. What is "outbound" stream
> in this context, why can the hash calculatiom method change and what exactly
> does it mean "a stream being incorrectly split"?
>
> Yes RSS is indeed a received stream but that is used by lagg for lacp
> and loadbalance protocols to decide which port of the lagg to "send" the
> packet out of. As the flowid is not known when a new "output" stream is
> instigated the current code falls back to manual hash calculation to
> determine which port to send the initial packet from. Once a response is
> received a tx then uses the flowid. This change of hash calculation
> method can result in the initial packet being sent from a different port
> than the rest of the stream; this is what I meant by "incorrectly split".
>
> For my understanding, is this just an issue for the first packet when we
> originate the flow? Once we have a response and if flowid is there, we'd
> use it, right? OR am I missing something?
>
> Initially yes, but that can cause a whole cascading set of problems. If the
> source machine sends from two different ports then flow can traverse across
> the network using different paths and hence arrive at the destination on
> different ports too, causing the corresponding  issue on the other side.
>
> And with this change, we'd always go and do manual calculation even when
> we have a valid flowid (i.e. we didn't initiate a connection)?
>
> Correct, but there's potentially no easy way to correctly determine what the
> flowid and hence hash should be in this case, likely impossible if the lagg
> consists of different interface types.
>
> In addition if the hardware hash doesn't match the requested one as per
> laggproto then additional issues could also be triggered.
>
> Our TCP stack seems fragile during setup to out of order packets which this
> multipath behavior causes, we've seen this on our loadbalancers which is
> what triggered the investigation. The concrete result is many aborted TCP
> connections, over 300k ~2% on the machine I'm looking at.
>
> I hope there's some improvements that can be made, for example if we can
> determine the stream was instigated remotely then flowid would always be
> valid hence we can use it assuming it matches the requested spec or if we
> can make it clear to the user that laggproto is not the one they requested,
> I'm open to ideas?
>
>     Regards
>     Steve
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-Vmom8jOsMnzVX276JRe5kN-OM%2BCH%2B6LKtEL_hCgh6-XD2kg>