Date: Thu, 19 Feb 2015 07:28:14 -0800 From: Adrian Chadd <adrian@freebsd.org> To: John Baldwin <jhb@freebsd.org> Cc: "freebsd-arch@freebsd.org" <arch@freebsd.org> Subject: Re: RFC: bus_get_cpus(9) Message-ID: <CAJ-VmokFzuN2Q4ds6zV2M8rd%2B%2BbZ5M0oTaiFFBYueTekCL0s0w@mail.gmail.com> In-Reply-To: <1848011.eGOHhpCEMm@ralph.baldwin.cx> References: <1848011.eGOHhpCEMm@ralph.baldwin.cx>
next in thread | previous in thread | raw e-mail | index | archive | help
Hi, On 19 February 2015 at 06:46, John Baldwin <jhb@freebsd.org> wrote: > One of the next steps for NUMA device-awareness is a way to let drivers know > which CPUs are ideal to use for interrupts (and in particular this is targeted > at multiqueue NICs that want to create a TX/RX ring pair per CPU). However, > for modern Intel systems at least, it is usually best to use CPUs from the > physical processor package that contains the I/O hub that a device connects to > (e.g. to allow DDIO to work). > > The PoC API I came up with is a new bus method called bus_get_cpus() that > returns a requested cpuset for a given device. It accepts an enum for the > second parameter that says the type of cpuset being requested. Currently two > valus are supported: > > - LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to the > device when NUMA is enabled) > - INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core) > > For a NIC driver the expectation is that the driver will call > 'bus_get_cpus(dev, INTR_CPUS, &set)' and create queues for each of the CPUs in > 'set'. (In my current patchset I have updated igb(4) to use this approach.) > > For systems that do not support NUMA (or if it is not enabled in the kernel > config), LOCAL_CPUS is mapped to 'all_cpus' by default in the 'root_bus' > driver. INTR_CPUS is also mapped to 'all_cpus' by default. > > The x86 interrupt code maintains its own set of interrupt CPUs which this > patch now exposes via INTR_CPUS in the x86 nexus driver. > > The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable > LOCAL_CPUS set when _PXM exists and NUMA is enabled. They also and the global > INTR_CPUS set from the nexus driver with the per-domain set from _PXM to > generate a local INTR_CPUS set for child devices. > > The current patch can be found here: > > https://github.com/bsdjhb/freebsd/compare/bsdjhb:master...numa_bus_get_cpus > > It includes a few other fixes besides the implementation of bus_get_cpu() (and > some things have already been committed such as > taskqueue_start_threads_cpuset() and CPU_COUNT()): > > - It fixes the x86 interrupt code to exclude modern SMT threads from the > default interrupt set. (Previously only Pentium 4-era HTT threads were > excluded.) > - It has a sample conversion of igb(4) to this interface (albeit ugly using > #if's). > > Longer term I think I would like to make the INTR_CPUS thing a bit more > formal. In particular, Solaris allows you to alter the set of CPUs that > handle interrupts via prctl (or a tool named something close to that). I > think I would like to have a dedicated global cpuset for that (but not named > "2", it would be a new WHICH level). That would allow userland to use cpuset > to alter the set of CPUs that handle interrupts in case you wanted to use SMT > for example. I think if we do this that all ithreads would have their cpusets > hang off of this set instead of the root set (which would also remove some of > the recent special case handling for ithreads I believe). The one uglier part > about this is that we should probably then have a way to notify drivers that > INTR_CPUS changed so that they could try to cope gracefully. I think that's a > bit of a longer horizon thing, but for now I think bus_get_cpus() is a good > next step. > > What do other folks think? (And yes, I know it needs a manpage before it goes > in, but I'd rather get the API agreed on before polishing that.) So I'd rather something slightly more descriptive about the iteration of cpu ids and cpusets for each queue in a driver. Eg, you're iterating over the interrupt set for a CPU, but it's still up to the driver to walk the cpuset array and figure out which queue goes to which CPU. For RSS I'm going to take this stuff and have the driver call into RSS to do this. It'll provide the deviceid, and it'll return: * how many queues? * for each queue id, what is the cpuset the interrupts/taskqueues should run on. It already does this, but it currently has no idea about the underlying device topology. (Later on for rebalancing we'll want to have those cpusets returned be some top level cpusets per RSS bucket that we can change and have everything cascade "right", but that's a later thing to worry about.) For the RSS and non-RSS case, I can see situations where the admin may wish to define a mapping for queues to cpusets that aren't necessarily 1:1 mapping like you've done. For example, saying "I want 16 queues, but I have four CPUs, here's how they're mapped." Right now drivers do round-robin, but it's hard coded. If we have an iteration API like what exists for RSS then we can hide the policy config in the bus code and have it check kenv/hints at boot time for what the config should be. So I'd like to have the API a little more higher level so we can do interesting things with it. -adrian
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-VmokFzuN2Q4ds6zV2M8rd%2B%2BbZ5M0oTaiFFBYueTekCL0s0w>