Date: Thu, 19 Feb 2015 10:37:28 -0500 From: John Baldwin <jhb@freebsd.org> To: Adrian Chadd <adrian@freebsd.org> Cc: "freebsd-arch@freebsd.org" <arch@freebsd.org> Subject: Re: RFC: bus_get_cpus(9) Message-ID: <6147240.5Rne9DUXyM@ralph.baldwin.cx> In-Reply-To: <CAJ-VmokFzuN2Q4ds6zV2M8rd%2B%2BbZ5M0oTaiFFBYueTekCL0s0w@mail.gmail.com> References: <1848011.eGOHhpCEMm@ralph.baldwin.cx> <CAJ-VmokFzuN2Q4ds6zV2M8rd%2B%2BbZ5M0oTaiFFBYueTekCL0s0w@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thursday, February 19, 2015 07:28:14 AM Adrian Chadd wrote: > Hi, > > On 19 February 2015 at 06:46, John Baldwin <jhb@freebsd.org> wrote: > > One of the next steps for NUMA device-awareness is a way to let drivers > > know which CPUs are ideal to use for interrupts (and in particular this > > is targeted at multiqueue NICs that want to create a TX/RX ring pair per > > CPU). However, for modern Intel systems at least, it is usually best to > > use CPUs from the physical processor package that contains the I/O hub > > that a device connects to (e.g. to allow DDIO to work). > > > > The PoC API I came up with is a new bus method called bus_get_cpus() that > > returns a requested cpuset for a given device. It accepts an enum for the > > second parameter that says the type of cpuset being requested. Currently > > two> > > valus are supported: > > - LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to > > the> > > device when NUMA is enabled) > > > > - INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core) > > > > For a NIC driver the expectation is that the driver will call > > 'bus_get_cpus(dev, INTR_CPUS, &set)' and create queues for each of the > > CPUs in 'set'. (In my current patchset I have updated igb(4) to use this > > approach.) > > > > For systems that do not support NUMA (or if it is not enabled in the > > kernel > > config), LOCAL_CPUS is mapped to 'all_cpus' by default in the 'root_bus' > > driver. INTR_CPUS is also mapped to 'all_cpus' by default. > > > > The x86 interrupt code maintains its own set of interrupt CPUs which this > > patch now exposes via INTR_CPUS in the x86 nexus driver. > > > > The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable > > LOCAL_CPUS set when _PXM exists and NUMA is enabled. They also and the > > global INTR_CPUS set from the nexus driver with the per-domain set from > > _PXM to generate a local INTR_CPUS set for child devices. > > > > The current patch can be found here: > > > > https://github.com/bsdjhb/freebsd/compare/bsdjhb:master...numa_bus_get_cpu > > s > > > > It includes a few other fixes besides the implementation of bus_get_cpu() > > (and some things have already been committed such as > > > > taskqueue_start_threads_cpuset() and CPU_COUNT()): > > - It fixes the x86 interrupt code to exclude modern SMT threads from the > > > > default interrupt set. (Previously only Pentium 4-era HTT threads were > > excluded.) > > > > - It has a sample conversion of igb(4) to this interface (albeit ugly > > using > > > > #if's). > > > > Longer term I think I would like to make the INTR_CPUS thing a bit more > > formal. In particular, Solaris allows you to alter the set of CPUs that > > handle interrupts via prctl (or a tool named something close to that). I > > think I would like to have a dedicated global cpuset for that (but not > > named "2", it would be a new WHICH level). That would allow userland to > > use cpuset to alter the set of CPUs that handle interrupts in case you > > wanted to use SMT for example. I think if we do this that all ithreads > > would have their cpusets hang off of this set instead of the root set > > (which would also remove some of the recent special case handling for > > ithreads I believe). The one uglier part about this is that we should > > probably then have a way to notify drivers that INTR_CPUS changed so that > > they could try to cope gracefully. I think that's a bit of a longer > > horizon thing, but for now I think bus_get_cpus() is a good next step. > > > > What do other folks think? (And yes, I know it needs a manpage before it > > goes in, but I'd rather get the API agreed on before polishing that.) > > So I'd rather something slightly more descriptive about the iteration > of cpu ids and cpusets for each queue in a driver. > > Eg, you're iterating over the interrupt set for a CPU, but it's still > up to the driver to walk the cpuset array and figure out which queue > goes to which CPU. > > For RSS I'm going to take this stuff and have the driver call into RSS > to do this. It'll provide the deviceid, and it'll return: > > * how many queues? > * for each queue id, what is the cpuset the interrupts/taskqueues should run > on. > > It already does this, but it currently has no idea about the > underlying device topology. > > (Later on for rebalancing we'll want to have those cpusets returned be > some top level cpusets per RSS bucket that we can change and have > everything cascade "right", but that's a later thing to worry about.) > > For the RSS and non-RSS case, I can see situations where the admin may > wish to define a mapping for queues to cpusets that aren't necessarily > 1:1 mapping like you've done. For example, saying "I want 16 queues, > but I have four CPUs, here's how they're mapped." Right now drivers do > round-robin, but it's hard coded. If we have an iteration API like > what exists for RSS then we can hide the policy config in the bus code > and have it check kenv/hints at boot time for what the config should > be. > > So I'd like to have the API a little more higher level so we can do > interesting things with it. There's nothing preventing the RSS code from calling bus_get_cpus() internally to populate the info it returns in its APIs. That is, I imagine something like: #ifdef RSS queue_info = fetch_rss_info(dev); for (queue in queue_info) { create queue for CPU queue->cpu } #else /* Use bus_get_cpus directly and do 1:1 */ #endif That is, I think RSS should provide a layer on top of new-bus, not be a bus_foo API. At some point all drivers might only have the #ifdef RSS case and not use bus_get_cpus() directly at all, but it doesn't seem like the RSS API is quite there yet. -- John Baldwin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?6147240.5Rne9DUXyM>