Date: Fri, 22 Jul 2016 12:23:35 -0700 From: Adrian Chadd <adrian@freebsd.org> To: Sepherosa Ziehau <sepherosa@gmail.com> Cc: Andrew Gallatin <gallatin@cs.duke.edu>, FreeBSD Net <freebsd-net@freebsd.org> Subject: Re: proposal: splitting NIC RSS up from stack RSS Message-ID: <CAJ-Vmomj5XjtmqbTukmxqdiF_A-Ga1jFMA5r24=CXcG0gueYsg@mail.gmail.com> In-Reply-To: <CAMOc5cxEWqWOMPSXFe3=N5S93bs8RO-XX22QghtHd8vC5xuNjA@mail.gmail.com> References: <CAJ-Vmo=Wj3ZuC6mnVCxonQ74nfEmH7CE=TP3xhLzWifdBxxfBQ@mail.gmail.com> <306af514-70ff-f3bf-5b4f-da7ac1ec6580@cs.duke.edu> <CAJ-VmomHYVCknVkDLF%2Bb8Gc5wBWxkddEMY3dhvbxJihLZHyTLg@mail.gmail.com> <CAMOc5cxEWqWOMPSXFe3=N5S93bs8RO-XX22QghtHd8vC5xuNjA@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 21 July 2016 at 18:54, Sepherosa Ziehau <sepherosa@gmail.com> wrote: > On Fri, Jul 22, 2016 at 6:39 AM, Adrian Chadd <adrian@freebsd.org> wrote: >> hi, >> >> Cool! Yeah, the RSS bits thing can be removed, as it's just doing a >> bitmask instead of a % operator to do mapping. I think we can just go >> to % and if people need the extra speed from a power-of-two operation, >> they can reintroduce it. > > I thought about it a while ago (the most popular E5-2560v{1,2,3} only > has 6 cores, but E5-2560v4 has 8 cores! :). Since the raw RSS hash > value is '& 0x1f' (I believe most of the NICs use 128 entry indirect > table as defined by MS RSS) to select an entry in the indirect table, > simply '%' on the raw RSS hash value probably will not work properly; > you will need (hash&0x1f)%mp_ncpus at least. And well, since the > indirect table's size if 128, you still will get some uneven CPU > workload for non-power-of-2 cpus. And if you take cpu affinity into > consideration, the situation will be even more complex ... Hi, Sure. The biggest annoying part is that a lot of the kernel infrastructure for queueing packets (netisr) and scheduling stack work (callouts) are indexed on CPU, not on "thing". If it was indexed on "thing" then we could do a two stage work redistribution method that'd scale O(1): * packets get plonked into "thing" via some mapping table - eg, map 128 or 256 buckets to queues that do work / schedule call outs / netisr; and * the queues aren't tied to a CPU at this point, and it can get shuffled around by using cpumasks. It'd be really, really nice IMHO if we had netisr and callouts be "thing" based rather than "cpu" based, so we could just shift work by changing the CPU mask - then we don't have to worry about rescheduling packets or work onto the new CPU when we want to move load around. That doesn't risk out of order packet handling behaviour and it means we can (in theory!) put a given RSS bucket into more than one CPU, for things like TCP processing. Trouble is, this is somewhat contentious. I could do the netisr change without upsetting people, but the callout code honestly makes me want to set everything (in sys/kern) on fire and start again. After all of the current issues with the callout subsystem I kind of just want to see hps finish his work and land it into head, complete with more sensible lock semantics, before I look at breaking it out to not be per-CPU based but instead allow subsystems to create their own worker pools for callouts. I'm sure NFS and CAM would like this kind of thing too. Since people have asked me about this in the past, the side effect of support dynamic hash mapping (even in software) is that for any given flow, once you change the hash mapping you will have some packets in said flow in the old queue and some packets in the new queue. For things like stack TCP/UDP where it's using pcbgroups it can vary from being slow to (eventually, when the global list goes away) plainly not making it to the right pcb/socket, which is okay for some workloads and not for others. That may be a fun project to work on once the general stack / driver tidyups are done, but I'm going to resist doing it myself for a while because it'll introduce the above uncertainties which will cause out-of-order behaviour that'll likely generate more problem reports than I want to handle. (Read: since I'm doing this for free, I'm not going to do anything risky, as I'm not getting paid to wade through the repercussions just right now.) FWIW, we had this same problem in ye olde past with squid and WCCP with its hash based system. Squid's WCCP implementation was simple and static. The commercial solutions (read: cisco, etc) implemented handling the cache set changing / hash traffic map changing by having the caches redirect traffic to the /old/ cache whenever the hash or cache set changed. Squid didn't do this out of the box, so if the cache topology changed it would send traffic to the wrong box and the existing connections would break. -adrian
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-Vmomj5XjtmqbTukmxqdiF_A-Ga1jFMA5r24=CXcG0gueYsg>