From owner-freebsd-net@freebsd.org Fri Jul 22 19:23:37 2016 Return-Path: Delivered-To: freebsd-net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 80F44BA1C9B for ; Fri, 22 Jul 2016 19:23:37 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: from mail-io0-x22f.google.com (mail-io0-x22f.google.com [IPv6:2607:f8b0:4001:c06::22f]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4ECBD1E4A for ; Fri, 22 Jul 2016 19:23:37 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: by mail-io0-x22f.google.com with SMTP id b62so113617831iod.3 for ; Fri, 22 Jul 2016 12:23:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=nL76gfArXyNS5cIl4Is1mwrmdVjBY+lEVdIlEJuopwg=; b=xMxAcJAW22pnMURVXHJo7OzThYn1OWVuEvWYjKkuhBMAiPW+QZWyiOXdVTjB1l8jFo lwP0wM+35LRSWjBnclKBo31fFzWVs7A64U4gL5WS2DXR7BZPap1x5WHXsIRZDpxmykuz j8MllznwA4TkN8rApoSREAvMwkiP3yh3G5HEZm7HkyctrIdx6FSgED0VoN9oiuor4/2M GhX9E/aPmtBp8Zknaf39Ks6GkbfXf+EmAm+zJ20pS+EOtog7XorZlqDaRfQBm5p0mjOo GQsA662EykaGjnNnoTRmWZ6Yh5+NRLr03fRcplSxSJuyCgVBdnskLnZayR/0mxh5wu+x PfDA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=nL76gfArXyNS5cIl4Is1mwrmdVjBY+lEVdIlEJuopwg=; b=bCg0uaEEArzbVXPExyBYxRS3yfo+ROy+zH+luVX53amm7BSLNOu34ipjg6Yx5V9lg0 6dIyPz67kJaboT/0VrwrVYhPQZ55mmlXKLIGH1w6mC6p6WeQFaq1QZ32VdJZmzJqZQCt bav23E6qhwpXmx2cwy3B5XPTyjbwb5b7dEiwRwwdRTfJBzU/aS6lOlABTBTzEkYdM2T0 RbDaXYCN7F4rXom3jIhbk9v/4z/Q7ZGNrUlw76XfQ2Rb1uSOuUP+t7yWLkP+ZTTTTH9Y DqinP2naZtICPl7wyTduQwuriOwMYlKl+lRhB55194d6dkbYh6wMJ57rEgfD91cMYeIG CgoA== X-Gm-Message-State: AEkoouuI+1zGBOMcXil35FCY11WN28ORlbpwKaYxM2UxuStPTJvI4G+nj17eEXiszATn+ppOlGfEmUT0U18T+Q== X-Received: by 10.107.13.70 with SMTP id 67mr6519163ion.75.1469215416621; Fri, 22 Jul 2016 12:23:36 -0700 (PDT) MIME-Version: 1.0 Sender: adrian.chadd@gmail.com Received: by 10.36.141.129 with HTTP; Fri, 22 Jul 2016 12:23:35 -0700 (PDT) In-Reply-To: References: <306af514-70ff-f3bf-5b4f-da7ac1ec6580@cs.duke.edu> From: Adrian Chadd Date: Fri, 22 Jul 2016 12:23:35 -0700 X-Google-Sender-Auth: M4J2YYsd2MsUcoW01BbDvmSLtIA Message-ID: Subject: Re: proposal: splitting NIC RSS up from stack RSS To: Sepherosa Ziehau Cc: Andrew Gallatin , FreeBSD Net Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 22 Jul 2016 19:23:37 -0000 On 21 July 2016 at 18:54, Sepherosa Ziehau wrote: > On Fri, Jul 22, 2016 at 6:39 AM, Adrian Chadd wrote: >> hi, >> >> Cool! Yeah, the RSS bits thing can be removed, as it's just doing a >> bitmask instead of a % operator to do mapping. I think we can just go >> to % and if people need the extra speed from a power-of-two operation, >> they can reintroduce it. > > I thought about it a while ago (the most popular E5-2560v{1,2,3} only > has 6 cores, but E5-2560v4 has 8 cores! :). Since the raw RSS hash > value is '& 0x1f' (I believe most of the NICs use 128 entry indirect > table as defined by MS RSS) to select an entry in the indirect table, > simply '%' on the raw RSS hash value probably will not work properly; > you will need (hash&0x1f)%mp_ncpus at least. And well, since the > indirect table's size if 128, you still will get some uneven CPU > workload for non-power-of-2 cpus. And if you take cpu affinity into > consideration, the situation will be even more complex ... Hi, Sure. The biggest annoying part is that a lot of the kernel infrastructure for queueing packets (netisr) and scheduling stack work (callouts) are indexed on CPU, not on "thing". If it was indexed on "thing" then we could do a two stage work redistribution method that'd scale O(1): * packets get plonked into "thing" via some mapping table - eg, map 128 or 256 buckets to queues that do work / schedule call outs / netisr; and * the queues aren't tied to a CPU at this point, and it can get shuffled around by using cpumasks. It'd be really, really nice IMHO if we had netisr and callouts be "thing" based rather than "cpu" based, so we could just shift work by changing the CPU mask - then we don't have to worry about rescheduling packets or work onto the new CPU when we want to move load around. That doesn't risk out of order packet handling behaviour and it means we can (in theory!) put a given RSS bucket into more than one CPU, for things like TCP processing. Trouble is, this is somewhat contentious. I could do the netisr change without upsetting people, but the callout code honestly makes me want to set everything (in sys/kern) on fire and start again. After all of the current issues with the callout subsystem I kind of just want to see hps finish his work and land it into head, complete with more sensible lock semantics, before I look at breaking it out to not be per-CPU based but instead allow subsystems to create their own worker pools for callouts. I'm sure NFS and CAM would like this kind of thing too. Since people have asked me about this in the past, the side effect of support dynamic hash mapping (even in software) is that for any given flow, once you change the hash mapping you will have some packets in said flow in the old queue and some packets in the new queue. For things like stack TCP/UDP where it's using pcbgroups it can vary from being slow to (eventually, when the global list goes away) plainly not making it to the right pcb/socket, which is okay for some workloads and not for others. That may be a fun project to work on once the general stack / driver tidyups are done, but I'm going to resist doing it myself for a while because it'll introduce the above uncertainties which will cause out-of-order behaviour that'll likely generate more problem reports than I want to handle. (Read: since I'm doing this for free, I'm not going to do anything risky, as I'm not getting paid to wade through the repercussions just right now.) FWIW, we had this same problem in ye olde past with squid and WCCP with its hash based system. Squid's WCCP implementation was simple and static. The commercial solutions (read: cisco, etc) implemented handling the cache set changing / hash traffic map changing by having the caches redirect traffic to the /old/ cache whenever the hash or cache set changed. Squid didn't do this out of the box, so if the cache topology changed it would send traffic to the wrong box and the existing connections would break. -adrian