Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 18 Jun 2025 02:13:35 GMT
From:      Olivier Certner <olce@FreeBSD.org>
To:        src-committers@FreeBSD.org, dev-commits-src-all@FreeBSD.org, dev-commits-src-main@FreeBSD.org
Subject:   git: a33225efb4bc - main - sched_ule: Sanitize CPU's use and priority computations, and ticks storage
Message-ID:  <202506180213.55I2DZDR024794@gitrepo.freebsd.org>

next in thread | raw e-mail | index | archive | help
The branch main has been updated by olce:

URL: https://cgit.FreeBSD.org/src/commit/?id=a33225efb4bc2598e4caa1c1f7f715653e8b1fda

commit a33225efb4bc2598e4caa1c1f7f715653e8b1fda
Author:     Olivier Certner <olce@FreeBSD.org>
AuthorDate: 2024-06-25 16:12:24 +0000
Commit:     Olivier Certner <olce@FreeBSD.org>
CommitDate: 2025-06-18 02:09:38 +0000

    sched_ule: Sanitize CPU's use and priority computations, and ticks storage
    
    Computation of %CPU in sched_pctcpu() was overly complicated, wrong in
    the case of a non-maximal window (10 seconds span; this is always the
    case in practice as the window would oscillate between 10 and 11 seconds
    for continuously running processes) and performed unshifted for the
    first part, essentially losing precision (up to 9% for SCHED_TICK_SECS
    being 10), and with some uneffective shift for the second part.
    Conserve maximum precision by only shifting by the require amount to
    attain FSHIFT before dividing.  Apply classical rounding to nearest
    instead of rounding down.
    
    To generally avoid wraparound problems with tick fields in 'struct
    td_sched' (as already happened once in sched_pctcpu_update()), make then
    all unsigned, and ensure 'ticks' is always converted to some 'u_int'.
    While here, fix SCHED_AFFINITY().
    
    Rewrite sched_pctcpu_update() while keeping the existing formulas:
    - Fix the hole in the cliff case that in theory 'ts_ticks' can become
      greater than the window size if a running thread has not been
      accounted for too long (today cannot happen because of sched_clock()).
    - Make the decay ratio explicit and configurable (SCHED_CPU_DECAY_NUMER,
      SCHED_CPU_DECAY_DENOM).  Set it to the current value (10/11),
      currently producing a 95% attenuation after about ~32s.  This eases
      experimenting with changing it.  Apply the ratio on shifted ticks for
      better precision, independently of the chosen value for
      SCHED_TICK_MAX/SCHED_TICK_SECS.
    - Remove redundant SCHED_TICK_TARG.  Compute SCHED_TICK_MAX from
      SCHED_TICK_SECS, the latter now really specifying the maximum size of
      the %CPU estimation window.
    - Ensure it is immune to varying 'hz' (which today can't happen), so
      that after computation SCHED_TICK_RUN(ts) is mathematically guaranteed
      lower than SCHED_TICK_LENGTH(ts).
    - Thoroughly explain the current formula, and mention its main drawback
      (it is completely dependent on the frequency of calls to
      sched_pctcpu_update(), which currently manifests itself for sleeping
      threads).
    
    Rework sched_priority():
    - Ensure 'p_nice' is read only once, to be immune to a concurrent
      change.
    - Clearly show that the computed priority is the sum of 3 components.
      Make them all positive by shifting the starting priority and shifting
      the nice value in SCHED_PRI_NICE().
    - Compute the priority offset deriving from the %CPU with rounding to
      nearest.
    - Much more informative KASSERT() output with details regarding the
      priority computation.
    
    MFC after:      1 month
    Event:          Kitchener-Waterloo Hackathon 202506
    Sponsored by:   The FreeBSD Foundation
    Differential Revision:  https://reviews.freebsd.org/D46567
---
 sys/kern/sched_ule.c | 192 +++++++++++++++++++++++++++++++--------------------
 1 file changed, 116 insertions(+), 76 deletions(-)

diff --git a/sys/kern/sched_ule.c b/sys/kern/sched_ule.c
index 087f62c2ee2f..0002424fa547 100644
--- a/sys/kern/sched_ule.c
+++ b/sys/kern/sched_ule.c
@@ -91,13 +91,14 @@ dtrace_vtime_switch_func_t	dtrace_vtime_switch_func;
 struct td_sched {
 	short		ts_flags;	/* TSF_* flags. */
 	int		ts_cpu;		/* CPU we are on, or were last on. */
-	int		ts_rltick;	/* Real last tick, for affinity. */
-	int		ts_slice;	/* Ticks of slice remaining. */
+	u_int		ts_rltick;	/* Real last tick, for affinity. */
+	u_int		ts_slice;	/* Ticks of slice remaining. */
+	u_int		ts_ftick;	/* %CPU window's first tick */
+	u_int		ts_ltick;	/* %CPU window's last tick */
+	/* All ticks count below are stored shifted by SCHED_TICK_SHIFT. */
 	u_int		ts_slptime;	/* Number of ticks we vol. slept */
 	u_int		ts_runtime;	/* Number of ticks we were running */
-	int		ts_ltick;	/* Last tick that we were running on */
-	int		ts_ftick;	/* First tick that we were running on */
-	int		ts_ticks;	/* Tick count */
+	u_int		ts_ticks;	/* pctcpu window's running tick count */
 #ifdef KTR
 	char		ts_name[TS_NAME_LEN];
 #endif
@@ -137,20 +138,14 @@ _Static_assert(sizeof(struct thread) + sizeof(struct td_sched) <=
  * and end of the MIN to MAX timeshare range are only reachable with negative
  * or positive nice respectively.
  *
- * PRI_NRESV:	Number of priority levels reserved to account for nice values.
+ * CPU_RANGE:	Length of range for priorities computed from CPU use.
+ * NICE:	Priority offset due to the nice value.
  *              5/4 is to preserve historical nice effect on computation ratios.
- * PRI_RANGE:	Length of range for priorities computed from CPU use.
- * PRI_TICKS:	Compute a priority in PRI_RANGE from the ticks count and total.
- * PRI_NICE:	Determines the part of the priority inherited from nice.
+ * NRESV:	Number of priority levels reserved to account for nice values.
  */
-#define	SCHED_PRI_NRESV		((PRIO_MAX - PRIO_MIN) * 5 / 4)
-#define	SCHED_PRI_MIN		(PRI_MIN_BATCH + (SCHED_PRI_NRESV / 2))
-#define	SCHED_PRI_MAX		(PRI_MAX_BATCH - (SCHED_PRI_NRESV / 2))
-#define	SCHED_PRI_RANGE		(SCHED_PRI_MAX - SCHED_PRI_MIN + 1)
-#define	SCHED_PRI_TICKS(ts)						\
-    (SCHED_TICK_HZ((ts)) /						\
-    (roundup(SCHED_TICK_TOTAL((ts)), SCHED_PRI_RANGE) / SCHED_PRI_RANGE))
-#define	SCHED_PRI_NICE(nice)	((nice) * 5 / 4)
+#define	SCHED_PRI_CPU_RANGE	(PRI_BATCH_RANGE - SCHED_PRI_NRESV)
+#define	SCHED_PRI_NICE(nice)	(((nice) - PRIO_MIN) * 5 / 4)
+#define	SCHED_PRI_NRESV		SCHED_PRI_NICE(PRIO_MAX)
 
 /*
  * Runqueue indices for the implemented scheduling policies' priority bounds.
@@ -180,19 +175,26 @@ _Static_assert(RQ_TS_POL_MAX != RQ_ID_POL_MIN,
 /*
  * Cpu percentage computation macros and defines.
  *
- * SCHED_TICK_SECS:	Number of seconds to average the cpu usage across.
- * SCHED_TICK_TARG:	Number of hz ticks to average the cpu usage across.
- * SCHED_TICK_MAX:	Maximum number of ticks before scaling back.
+ * SCHED_TICK_SECS:	Max number of seconds to average the cpu usage across.
+ *   Must be at most 20 to avoid overflow in sched_pctcpu()'s current formula.
+ * SCHED_TICK_MAX:	Max number of hz ticks matching SCHED_TICK_SECS.
  * SCHED_TICK_SHIFT:	Shift factor to avoid rounding away results.
- * SCHED_TICK_HZ:	Compute the number of hz ticks for a given ticks count.
- * SCHED_TICK_TOTAL:	Gives the amount of time we've been recording ticks.
- */
-#define	SCHED_TICK_SECS		10
-#define	SCHED_TICK_TARG		(hz * SCHED_TICK_SECS)
-#define	SCHED_TICK_MAX		(SCHED_TICK_TARG + hz)
-#define	SCHED_TICK_SHIFT	10
-#define	SCHED_TICK_HZ(ts)	((ts)->ts_ticks >> SCHED_TICK_SHIFT)
-#define	SCHED_TICK_TOTAL(ts)	(max((ts)->ts_ltick - (ts)->ts_ftick, hz))
+ * SCHED_TICK_RUN_SHIFTED: Number of shifted ticks running in last window.
+ * SCHED_TICK_LENGTH:	Length of last window in shifted ticks or 1 if empty.
+ * SCHED_CPU_DECAY_NUMER: Numerator of %CPU decay factor.
+ * SCHED_CPU_DECAY_DENOM: Denominator of %CPU decay factor.
+ */
+#define	SCHED_TICK_SECS			11
+#define	SCHED_TICK_MAX(hz)		((hz) * SCHED_TICK_SECS)
+#define	SCHED_TICK_SHIFT		10
+#define	SCHED_TICK_RUN_SHIFTED(ts)	((ts)->ts_ticks)
+#define	SCHED_TICK_LENGTH(ts)		(max((ts)->ts_ltick - (ts)->ts_ftick, 1))
+#define	SCHED_CPU_DECAY_NUMER		10
+#define	SCHED_CPU_DECAY_DENOM		11
+_Static_assert(SCHED_CPU_DECAY_NUMER >= 0 && SCHED_CPU_DECAY_DENOM > 0 &&
+    SCHED_CPU_DECAY_NUMER <= SCHED_CPU_DECAY_DENOM,
+    "Inconsistent values for SCHED_CPU_DECAY_NUMER and/or "
+    "SCHED_CPU_DECAY_DENOM");
 
 /*
  * These determine the interactivity of a process.  Interactivity differs from
@@ -308,7 +310,11 @@ struct tdq {
 struct cpu_group __read_mostly *cpu_top;		/* CPU topology */
 
 #define	SCHED_AFFINITY_DEFAULT	(max(1, hz / 1000))
-#define	SCHED_AFFINITY(ts, t)	((ts)->ts_rltick > ticks - ((t) * affinity))
+/*
+ * This inequality has to be written with a positive difference of ticks to
+ * correctly handle wraparound.
+ */
+#define	SCHED_AFFINITY(ts, t)	((u_int)ticks - (ts)->ts_rltick < (t) * affinity)
 
 /*
  * Run-time tunables.
@@ -665,7 +671,7 @@ tdq_load_rem(struct tdq *tdq, struct thread *td)
  * aside from curthread and target no more than sched_slice latency but
  * no less than sched_slice_min runtime.
  */
-static inline int
+static inline u_int
 tdq_slice(struct tdq *tdq)
 {
 	int load;
@@ -1757,9 +1763,12 @@ static void
 sched_priority(struct thread *td)
 {
 	u_int pri, score;
+	int nice;
 
 	if (PRI_BASE(td->td_pri_class) != PRI_TIMESHARE)
 		return;
+
+	nice = td->td_proc->p_nice;
 	/*
 	 * If the score is interactive we place the thread in the realtime
 	 * queue with a priority that is less than kernel and interrupt
@@ -1773,7 +1782,7 @@ sched_priority(struct thread *td)
 	 * score.  Negative nice values make it easier for a thread to be
 	 * considered interactive.
 	 */
-	score = imax(0, sched_interact_score(td) + td->td_proc->p_nice);
+	score = imax(0, sched_interact_score(td) + nice);
 	if (score < sched_interact) {
 		pri = PRI_MIN_INTERACT;
 		pri += (PRI_MAX_INTERACT - PRI_MIN_INTERACT + 1) * score /
@@ -1782,17 +1791,26 @@ sched_priority(struct thread *td)
 		    ("sched_priority: invalid interactive priority %u score %u",
 		    pri, score));
 	} else {
-		pri = SCHED_PRI_MIN;
-		if (td_get_sched(td)->ts_ticks)
-			pri += min(SCHED_PRI_TICKS(td_get_sched(td)),
-			    SCHED_PRI_RANGE - 1);
-		pri += SCHED_PRI_NICE(td->td_proc->p_nice);
+		const struct td_sched *const ts = td_get_sched(td);
+		const u_int run = SCHED_TICK_RUN_SHIFTED(ts);
+		const u_int run_unshifted __unused = (run +
+		    (1 << SCHED_TICK_SHIFT) / 2) >> SCHED_TICK_SHIFT;
+		const u_int len = SCHED_TICK_LENGTH(ts);
+		const u_int nice_pri_off = SCHED_PRI_NICE(nice);
+		const u_int cpu_pri_off = (((SCHED_PRI_CPU_RANGE - 1) *
+		    run + len / 2) / len + (1 << SCHED_TICK_SHIFT) / 2) >>
+		    SCHED_TICK_SHIFT;
+
+		MPASS(cpu_pri_off < SCHED_PRI_CPU_RANGE);
+		pri = PRI_MIN_BATCH + cpu_pri_off + nice_pri_off;
 		KASSERT(pri >= PRI_MIN_BATCH && pri <= PRI_MAX_BATCH,
-		    ("sched_priority: invalid priority %u: nice %d, "
-		    "ticks %d ftick %d ltick %d tick pri %d",
-		    pri, td->td_proc->p_nice, td_get_sched(td)->ts_ticks,
-		    td_get_sched(td)->ts_ftick, td_get_sched(td)->ts_ltick,
-		    SCHED_PRI_TICKS(td_get_sched(td))));
+		    ("sched_priority: Invalid computed priority %u: "
+		    "Should be between %u and %u (PRI_MIN_BATCH: %u; "
+		    "Window size (ticks): %u, runtime (shifted ticks): %u,"
+		    "(unshifted ticks): %u => CPU pri off: %u; "
+		    "Nice: %d => nice pri off: %u)",
+		    pri, PRI_MIN_BATCH, PRI_MAX_BATCH, PRI_MIN_BATCH,
+		    len, run, run_unshifted, cpu_pri_off, nice, nice_pri_off));
 	}
 	sched_user_prio(td, pri);
 
@@ -1877,8 +1895,8 @@ schedinit(void)
 	 * Set up the scheduler specific parts of thread0.
 	 */
 	ts0 = td_get_sched(&thread0);
-	ts0->ts_ltick = ticks;
-	ts0->ts_ftick = ticks;
+	ts0->ts_ftick = (u_int)ticks;
+	ts0->ts_ltick = ts0->ts_ftick;
 	ts0->ts_slice = 0;
 	ts0->ts_cpu = curcpu;	/* set valid CPU number */
 }
@@ -1917,30 +1935,56 @@ sched_rr_interval(void)
 }
 
 /*
- * Update the percent cpu tracking information when it is requested or
- * the total history exceeds the maximum.  We keep a sliding history of
- * tick counts that slowly decays.  This is less precise than the 4BSD
- * mechanism since it happens with less regular and frequent events.
+ * Update the percent cpu tracking information when it is requested or the total
+ * history exceeds the maximum.  We keep a sliding history of tick counts that
+ * slowly decays, for running threads (see comments below for more details).
+ * This is less precise than the 4BSD mechanism since it happens with less
+ * regular and frequent events.
  */
 static void
 sched_pctcpu_update(struct td_sched *ts, int run)
 {
-	int t = ticks;
+	const u_int t = (u_int)ticks;
+	u_int t_max = SCHED_TICK_MAX((u_int)hz);
+	u_int t_tgt = ((t_max << SCHED_TICK_SHIFT) * SCHED_CPU_DECAY_NUMER /
+	    SCHED_CPU_DECAY_DENOM) >> SCHED_TICK_SHIFT;
+	const u_int lu_span = t - ts->ts_ltick;
 
-	/*
-	 * The signed difference may be negative if the thread hasn't run for
-	 * over half of the ticks rollover period.
-	 */
-	if ((u_int)(t - ts->ts_ltick) >= SCHED_TICK_TARG) {
-		ts->ts_ticks = 0;
-		ts->ts_ftick = t - SCHED_TICK_TARG;
-	} else if (t - ts->ts_ftick >= SCHED_TICK_MAX) {
-		ts->ts_ticks = (ts->ts_ticks / (ts->ts_ltick - ts->ts_ftick)) *
-		    (ts->ts_ltick - (t - SCHED_TICK_TARG));
-		ts->ts_ftick = t - SCHED_TICK_TARG;
+	if (lu_span >= t_tgt) {
+		/*
+		 * Forget all previous ticks if we are more than t_tgt
+		 * (currently, 10s) apart from the last update.  Don't account
+		 * for more than 't_tgt' ticks when running.
+		 */
+		ts->ts_ticks = run ? (t_tgt << SCHED_TICK_SHIFT) : 0;
+		ts->ts_ftick = t - t_tgt;
+		ts->ts_ltick = t;
+		return;
+	}
+
+	if (t - ts->ts_ftick >= t_max) {
+		/*
+		 * First reduce the existing ticks to proportionally occupy only
+		 * what's left of the target window given 'lu_span' will occupy
+		 * the rest.  Since sched_clock() is called frequently on
+		 * running threads, these threads have a small 'lu_span', and
+		 * the next formula basically becomes an exponential decay with
+		 * ratio r = SCHED_CPU_DECAY_NUMER / SCHED_CPU_DECAY_DENOM
+		 * (currently, 10/11) and period 1s.  However, a sleeping thread
+		 * will see its accounted ticks drop linearly with a high slope
+		 * with respect to 'lu_span', approaching 0 as 'lu_span'
+		 * approaches 't_tgt' (so, continuously with respect to the
+		 * previous case).  This rescaling is completely dependent on
+		 * the frequency of calls and the span since last update passed
+		 * at each call.
+		 */
+		ts->ts_ticks = SCHED_TICK_RUN_SHIFTED(ts) /
+		    SCHED_TICK_LENGTH(ts) * (t_tgt - lu_span);
+		ts->ts_ftick = t - t_tgt;
 	}
+
 	if (run)
-		ts->ts_ticks += (t - ts->ts_ltick) << SCHED_TICK_SHIFT;
+		ts->ts_ticks += lu_span << SCHED_TICK_SHIFT;
 	ts->ts_ltick = t;
 }
 
@@ -2300,9 +2344,9 @@ sched_switch(struct thread *td, int flags)
 #ifdef SMP
 	pickcpu = (td->td_flags & TDF_PICKCPU) != 0;
 	if (pickcpu)
-		ts->ts_rltick = ticks - affinity * MAX_CACHE_LEVELS;
+		ts->ts_rltick = (u_int)ticks - affinity * MAX_CACHE_LEVELS;
 	else
-		ts->ts_rltick = ticks;
+		ts->ts_rltick = (u_int)ticks;
 #endif
 	td->td_lastcpu = td->td_oncpu;
 	preempted = (td->td_flags & TDF_SLICEEND) == 0 &&
@@ -2655,7 +2699,7 @@ SCHED_STAT_DEFINE(ithread_preemptions,
  * Return time slice for a given thread.  For ithreads this is
  * sched_slice.  For other threads it is tdq_slice(tdq).
  */
-static inline int
+static inline u_int
 td_slice(struct thread *td, struct tdq *tdq)
 {
 	if (PRI_BASE(td->td_pri_class) == PRI_ITHD)
@@ -2940,22 +2984,18 @@ sched_rem(struct thread *td)
 fixpt_t
 sched_pctcpu(struct thread *td)
 {
-	fixpt_t pctcpu;
 	struct td_sched *ts;
-
-	pctcpu = 0;
-	ts = td_get_sched(td);
+	u_int len;
+	fixpt_t pctcpu;
 
 	THREAD_LOCK_ASSERT(td, MA_OWNED);
+	ts = td_get_sched(td);
 	sched_pctcpu_update(ts, TD_IS_RUNNING(td));
-	if (ts->ts_ticks) {
-		int rtick;
-
-		/* How many rtick per second ? */
-		rtick = min(SCHED_TICK_HZ(ts) / SCHED_TICK_SECS, hz);
-		pctcpu = (FSCALE * ((FSCALE * rtick)/hz)) >> FSHIFT;
-	}
-
+	len = SCHED_TICK_LENGTH(ts);
+	pctcpu = ((FSHIFT >= SCHED_TICK_SHIFT ? /* Resolved at compile-time. */
+	    (SCHED_TICK_RUN_SHIFTED(ts) << (FSHIFT - SCHED_TICK_SHIFT)) :
+	    (SCHED_TICK_RUN_SHIFTED(ts) >> (SCHED_TICK_SHIFT - FSHIFT))) +
+	    len / 2) / len;
 	return (pctcpu);
 }
 



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?202506180213.55I2DZDR024794>