Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 15 Jan 2026 11:31:57 +0000
From:      David Chisnall <theraven@FreeBSD.org>
To:        Olivier Certner <olce@freebsd.org>
Cc:        Minsoo Choo <minsoochoo0122@proton.me>, freebsd-hackers <freebsd-hackers@freebsd.org>
Subject:   Re: HMP scheduling on FreeBSD
Message-ID:  <A6BBCEE7-B233-4F91-BB4A-7D91A169F09E@FreeBSD.org>
In-Reply-To: <1886427.OVFmXjEfDW@ravel>
References:  <0Ng09S3rEB0BvT9vzHqVKU7rWxoad96kjEc7U2LCwDFJKmmswXujip7qbRlo_BIhNKcI7d-2CUHdp9Dxr3-7hhafpD6uagJSFUCjtC9qRr4=@proton.me> <1886427.OVFmXjEfDW@ravel>

index | next in thread | previous in thread | raw e-mail

On 14 Jan 2026, at 22:14, Olivier Certner <olce@freebsd.org> wrote:
> 
> These are good first observations but they can only really apply in specific circumstances.  Converting core's capacity in run queue length can only drive a loaded system, not a mostly idle one.  This mechanism will also cause an increase in latency for threads running on performant cores.

There are also some fun corner cases.  For example, the first generation big.LITTLE systems typically used Cortex A53 and A57 cores.  The A57 was *much* faster, but it had four-cycle access to the L1 cache, whereas the A53 had single-cycle access.  Workloads that fitted in L1 were faster on the A53.  So this can be a core x workload (or *phase of workload*) metric.  That said, treating it as a per-core metric is probably fine unless you want to hook in performance counters and do dynamic measurement.

> There are several theoretical considerations that should be met *together*, such as fairness, latency, bias to performance or to energy (policy), affinity, cpusets (directives), etc., and...

The hot-plug aspect is also important.  The best energy efficiency comes from turning the CPU off entirely.  Power-aware schedulers want to have a strategy for turning cores off in the way that minimises *total* system power consumption.  This is tricky for a few reasons:

 - There’s a tradeoff between running a workload for a long time on a slow core or running it for a long time or on a fast core for a long time.  The heuristics that ULE collects to identify I/O-bound vs CPU-bound workloads are a starting point, but you also likely need to track the typical sleeping time.  If a workload sleeps for long enough that it’s worth turning a big core off (or into a deep low-power state), that wants a very different scheduling policy to one that’s sitting using 5% of a core most of the time.
 - Some systems have independent ability to shut off cores and their caches.  This has some interesting effects because snooping on another core’s cache is usually faster than going out to main memory, so sleeping a core but not its cache may improve performance of nearby cores (by a NUMA-dependent amount).  Similarly, shutting down another core’s caches may reduce performance of nearby ones (note: This usually doesn’t apply for fully inclusive caches, but most CPU vendors have been moving away from those).

Apple did a couple of things to support this kind of tuning.  The first was to add a slack parameter into the kqueue timeouts.  This let the scheduler coalesce wakeups.  For example, if you have a clock that’s running once a second to update the second tick, and you have a bunch of other things mostly sleeping and waking up once per second then it’s useful to align all of the others with the clock app’s wake so that you can turn on a high-performance core, wake up, and then sleep.  This is useful even on homogeneous SMP systems and would be a really good *first* step for this kind of work.

The second was to provide explicit hints to allow threads to indicate the kinds of cores that they want to run on.

All of which is to say that I’m not sure that starting from ULE is necessarily a good strategy, since it wasn’t designed with any of these constraints in mind.

Oh, and Apple isn’t perfect.  Their scheduler currently has a bunch of issues with systems that distribute work across threads, where the overall performance depends on the throughput of the slowest one.  For longer-running threads, they’ll interleave P and E cores, so you need to do fairly fine-grained work stealing to use their scheduler efficiently.

David

home | help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?A6BBCEE7-B233-4F91-BB4A-7D91A169F09E>