From owner-freebsd-net@FreeBSD.ORG Fri Jan 17 23:49:06 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 4B6FAA10; Fri, 17 Jan 2014 23:49:06 +0000 (UTC) Received: from mail-qe0-x231.google.com (mail-qe0-x231.google.com [IPv6:2607:f8b0:400d:c02::231]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id E9B8816C4; Fri, 17 Jan 2014 23:49:05 +0000 (UTC) Received: by mail-qe0-f49.google.com with SMTP id w4so4598418qeb.36 for ; Fri, 17 Jan 2014 15:49:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=hFfaRuwTmgDiGjJnQrbU0CuV6U5vcp0cHhZVP3RBwEQ=; b=f38WqMzWDcI7XC7wmbEO3tc9ucCucDBqk6Ak/y5ozcsSYFI37ct4us8YLqRwueAzP0 r9Icd/JFhSSSbIt4HkHrNs8G9vGJp5qeAF+OG4I0VPPzqXdhG1VXMOBln8m6MzhjoBYA tPSivX8X6ZhZ7w6/B9FSYmcsQIhtwvcT69p72z0cYxUxKSH/gznWFQ3JKzrKkUmPmlsR DYiid+vhNFhK2+jy+QzUe/bpuEwVs9a+4CAax2HOysRCdBmqjpvXaByEL3bgRoh6HwEH d6Y+OJBkDSwQo0YXg+xPN3wVOYB5CuE120ax72VUk29LQpWe8+tbmmceNPp/PoljZsqI Gmzg== MIME-Version: 1.0 X-Received: by 10.224.46.8 with SMTP id h8mr8157887qaf.49.1390002545112; Fri, 17 Jan 2014 15:49:05 -0800 (PST) Sender: adrian.chadd@gmail.com Received: by 10.224.52.8 with HTTP; Fri, 17 Jan 2014 15:49:05 -0800 (PST) In-Reply-To: <52D996FD.6090901@FreeBSD.org> References: <52D5138B.8050100@fsn.hu> <52D6525D.50102@FreeBSD.org> <52D84DB0.4050607@FreeBSD.org> <52D996FD.6090901@FreeBSD.org> Date: Fri, 17 Jan 2014 15:49:05 -0800 X-Google-Sender-Auth: sGdvjrGjuqrHeYCEpUuKc18OuSk Message-ID: Subject: Re: ECMP hash keys? From: Adrian Chadd To: "Alexander V. Chernikov" Content-Type: text/plain; charset=ISO-8859-1 Cc: "freebsd-net@freebsd.org" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 17 Jan 2014 23:49:06 -0000 On 17 January 2014 12:47, Alexander V. Chernikov wrote: > On 17.01.2014 02:08, Adrian Chadd wrote: >> The reason you need to make sure that you end up with hashes for both >> src,dst and dst,src being equivalent is to ensure that when you create >> an outbound socket, you know up front which path the receive path is >> going to come back on. Right now we don't mark new connections - >> inbound or outbound - with a flowid until we've received some data on >> it. > Well, this seems reasonable. > > However, how do you plan to interact with hardware RSS? Well, if it's doing Toeplitz in hardware, we'll just use that. DragonflyBSD does this. They program the RSS registers on startup to map parts of the RSS space to CPUs as required. But if it isn't, we will have to do our own toeplitz hashing in software. I thought the majority of NICs these day do the topelitz calculation in hardware anyway. > For example, currently Intel used to set flowid to cpu number (which can > be reasonable in some cases). Afair 82599 advanced RX descriptor > contains original value that can be extracted, but we can't change cpu > on which packet arrives on (well, we can reprogram indirection table, but..) Well, that's the point, right? > I can't see any easy way to accomplish custom SW RSS: > > We can possibly have 1-2-4 ingress HW queues per NIC, ignore driver > flowid, re-calculate with modified Toeplitz or similar and push to other > ncpu-1 neisr queues. That can work, but requires custom setup > (especially for lagg scenarios) and works well for small subset of > workloads. Well, lagg is the same but different. Ie, we still choose the outbound TX queue on _a_ NIC based on the CPU/netisr derived from flowid. But the outbound NIC has to be chosen a different way or you end up with sub-optimal TX queue selection. Scott found this @ Netflix and this is why lagg now doesn't use the low bits of the flowid when it chooses which port to send _out_ on. > It seems also guessing ingress flowid is not very much different between > symmetric and asymmetric hashing approaches. I think the problem here is that flowid has been a mostly-opaque value for way too long. I like the dragonflybsd approach - they added a hashid, not flowid, and the netisr path checks to see if the driver has stamped it with a hardware toeplitz hashid or not. If not, it does its own hashing and punts the frame to the correct netisr RX queue on the right CPU. For routing it may not matter as much- we could just short-circuit that so it runs on the current CPU all the way to transmit. For NAT, it may be worthwhile keeping the per-flow state local on a given CPU to exploit various cache/lock coherencies. I guess the fall out from all of this is that I'd rather we had better specified things like "what is flowid", "how can we specify affinity", etc, so we can use it if we want, and not use it if we don't. Right now we have a "kind of but not quite done" way of affinity, enough to mostly not break TCP/UDP flow ordering, but not enough to really exploit affinity. -a