FreeBSD Mail Archives

Date:      Wed, 18 Jun 2025 02:13:34 GMT
From:      Olivier Certner <olce@FreeBSD.org>
To:        src-committers@FreeBSD.org, dev-commits-src-all@FreeBSD.org, dev-commits-src-main@FreeBSD.org
Subject:   git: 6792f3411f6d - main - sched_ule: Recover previous nice and anti-starvation behaviors
Message-ID:  <202506180213.55I2DYiL024750@gitrepo.freebsd.org>

next in thread | raw e-mail | index | archive | help

The branch main has been updated by olce:

URL: https://cgit.FreeBSD.org/src/commit/?id=6792f3411f6d99e1698589835adbf6b7b51c7c74

commit 6792f3411f6d99e1698589835adbf6b7b51c7c74
Author:     Olivier Certner <olce@FreeBSD.org>
AuthorDate: 2024-06-17 19:34:14 +0000
Commit:     Olivier Certner <olce@FreeBSD.org>
CommitDate: 2025-06-18 02:09:37 +0000

    sched_ule: Recover previous nice and anti-starvation behaviors
    
    Justification for this change is to avoid disturbing ULE's behavior too
    much at this time.  We however acknowledge that the effect of "nice"
    values is extremely weak and will most probably change it going forward.
    
    Tuning allows to mostly recover ULE's behavior prior to the switch to
    a single 256-queue runqueue and the increase of the timesharing priority
    levels' range.
    
    After this change, in a series of test involving two long-running
    processes with varying nice values competing for the same CPU, we
    observe that used CPU time ratios of the highest priority process to
    change by at most 1.15% and on average by 0.46% (absolute differences).
    In relative differences, they change by at most 2% and on average by
    0.78%.
    
    In order to preserve these ratios, as the number of priority levels
    alloted to timesharing have been raised from 136 to 168 (and the subsets
    of them dedicated to either interactive or batch threads scaled
    accordingly), we keep the ratio of levels reserved to handle nice values
    to those reserved for CPU usage by applying a factor of 5/4 (which is
    close to 168/136).
    
    Time-based advance of the timesharing circular queue's head is ULE's
    main fairness and anti-starvation mechanism.  The higher number of
    queues subject to the timesharing scheduling policy is now compensated
    by allowing a greater increment of the head offset per tick.  Because
    there are now 109 queue levels dedicated to the timesharing scheduling
    policy (in contrast with the 168 levels alloted to timesharing levels,
    which include the former but also those dedicated to threads considered
    interactive) whereas there previously were 64 ones (priorities spread
    into a single, separate runqueue), we advance the circular queue's head
    7/4 faster (a ratio close to 109/64).
    
    While here, take into account 'cnt' as the number of ticks when
    advancing the circular queue's head.  This fix depends on the other code
    changes enabling incrementation by more than one.
    
    MFC after:      1 month
    Event:          Kitchener-Waterloo Hackathon 202506
    Sponsored by:   The FreeBSD Foundation
    Differential Revision:  https://reviews.freebsd.org/D46566
---
 sys/kern/sched_ule.c | 76 +++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 58 insertions(+), 18 deletions(-)

diff --git a/sys/kern/sched_ule.c b/sys/kern/sched_ule.c
index cffeb97d1a59..087f62c2ee2f 100644
--- a/sys/kern/sched_ule.c
+++ b/sys/kern/sched_ule.c
@@ -137,20 +137,20 @@ _Static_assert(sizeof(struct thread) + sizeof(struct td_sched) <=
  * and end of the MIN to MAX timeshare range are only reachable with negative
  * or positive nice respectively.
  *
- * PRI_RANGE:	Priority range for utilization dependent priorities.
- * PRI_NRESV:	Number of nice values.
+ * PRI_NRESV:	Number of priority levels reserved to account for nice values.
+ *              5/4 is to preserve historical nice effect on computation ratios.
+ * PRI_RANGE:	Length of range for priorities computed from CPU use.
  * PRI_TICKS:	Compute a priority in PRI_RANGE from the ticks count and total.
  * PRI_NICE:	Determines the part of the priority inherited from nice.
  */
-#define	SCHED_PRI_NRESV		(PRIO_MAX - PRIO_MIN)
-#define	SCHED_PRI_NHALF		(SCHED_PRI_NRESV / 2)
-#define	SCHED_PRI_MIN		(PRI_MIN_BATCH + SCHED_PRI_NHALF)
-#define	SCHED_PRI_MAX		(PRI_MAX_BATCH - SCHED_PRI_NHALF)
+#define	SCHED_PRI_NRESV		((PRIO_MAX - PRIO_MIN) * 5 / 4)
+#define	SCHED_PRI_MIN		(PRI_MIN_BATCH + (SCHED_PRI_NRESV / 2))
+#define	SCHED_PRI_MAX		(PRI_MAX_BATCH - (SCHED_PRI_NRESV / 2))
 #define	SCHED_PRI_RANGE		(SCHED_PRI_MAX - SCHED_PRI_MIN + 1)
 #define	SCHED_PRI_TICKS(ts)						\
     (SCHED_TICK_HZ((ts)) /						\
     (roundup(SCHED_TICK_TOTAL((ts)), SCHED_PRI_RANGE) / SCHED_PRI_RANGE))
-#define	SCHED_PRI_NICE(nice)	(nice)
+#define	SCHED_PRI_NICE(nice)	((nice) * 5 / 4)
 
 /*
  * Runqueue indices for the implemented scheduling policies' priority bounds.
@@ -279,6 +279,11 @@ struct tdq {
 	u_char		tdq_owepreempt;	/* (f) Remote preemption pending. */
 	u_char		tdq_ts_off;	/* (t) TS insertion offset. */
 	u_char		tdq_ts_deq_off;	/* (t) TS dequeue offset. */
+	/*
+	 * (t) Number of (stathz) ticks since last offset incrementation
+	 * correction.
+	 */
+	u_char		tdq_ts_ticks;
 	int		tdq_id;		/* (c) cpuid. */
 	struct runq	tdq_runq;	/* (t) Run queue. */
 	char		tdq_name[TDQ_NAME_LEN];
@@ -362,6 +367,7 @@ static void tdq_setup(struct tdq *, int i);
 static void tdq_load_add(struct tdq *, struct thread *);
 static void tdq_load_rem(struct tdq *, struct thread *);
 static inline void tdq_runq_add(struct tdq *, struct thread *, int);
+static inline void tdq_advance_ts_deq_off(struct tdq *, bool);
 static inline void tdq_runq_rem(struct tdq *, struct thread *);
 static inline int sched_shouldpreempt(int, int, int);
 static void tdq_print(int cpu);
@@ -557,6 +563,33 @@ tdq_runq_add(struct tdq *tdq, struct thread *td, int flags)
 		runq_add(&tdq->tdq_runq, td, flags);
 }
 
+/*
+ * Advance the timesharing dequeue offset to the next non-empty queue or the
+ * insertion offset, whichever is closer.
+ *
+ * If 'deq_queue_known_empty' is true, then the queue where timesharing threads
+ * are currently removed for execution (pointed to by 'tdq_ts_deq_off') is
+ * assumed empty.  Otherwise, this condition is checked for.
+ */
+static inline void
+tdq_advance_ts_deq_off(struct tdq *tdq, bool deq_queue_known_empty)
+{
+	/*
+	 * We chose a simple iterative algorithm since the difference between
+	 * offsets is small in practice (see sched_clock()).
+	 */
+	while (tdq->tdq_ts_deq_off != tdq->tdq_ts_off) {
+		if (deq_queue_known_empty)
+			deq_queue_known_empty = false;
+		else if (!runq_is_queue_empty(&tdq->tdq_runq,
+		    tdq->tdq_ts_deq_off + RQ_TS_POL_MIN))
+			break;
+
+		tdq->tdq_ts_deq_off = (tdq->tdq_ts_deq_off + 1) %
+		    RQ_TS_POL_MODULO;
+	}
+}
+
 /*
  * Remove a thread from a run-queue.  This typically happens when a thread
  * is selected to run.  Running threads are not on the queue and the
@@ -579,14 +612,13 @@ tdq_runq_rem(struct tdq *tdq, struct thread *td)
 	/*
 	 * If thread has a batch priority and the queue from which it was
 	 * removed is now empty, advance the batch's queue removal index if it
-	 * lags with respect to the batch's queue insertion index.
+	 * lags with respect to the batch's queue insertion index, so that we
+	 * may eventually be able to advance the latter in sched_clock().
 	 */
-	if (queue_empty && PRI_MIN_BATCH <= td->td_priority &&
-	    td->td_priority <= PRI_MAX_BATCH &&
-	    tdq->tdq_ts_off != tdq->tdq_ts_deq_off &&
+	if (PRI_MIN_BATCH <= td->td_priority &&
+	    td->td_priority <= PRI_MAX_BATCH && queue_empty &&
 	    tdq->tdq_ts_deq_off + RQ_TS_POL_MIN == td->td_rqindex)
-		tdq->tdq_ts_deq_off = (tdq->tdq_ts_deq_off + 1) %
-		    RQ_TS_POL_MODULO;
+		tdq_advance_ts_deq_off(tdq, true);
 }
 
 /*
@@ -2664,13 +2696,21 @@ sched_clock(struct thread *td, int cnt)
 
 	/*
 	 * Advance the insert offset once for each tick to ensure that all
-	 * threads get a chance to run.
+	 * threads get a chance to run.  In order not to change too much ULE's
+	 * anti-starvation and "nice" behaviors after the switch to a single
+	 * 256-queue runqueue, since the queue insert offset is incremented by
+	 * 1 at every tick (provided the system is not too loaded) and there are
+	 * now 109 distinct levels for the timesharing selection policy instead
+	 * of 64 before (separate runqueue), we apply a factor 7/4 when
+	 * increasing the insert offset, by incrementing it by 2 instead of
+	 * 1 except for one in four ticks.
 	 */
 	if (tdq->tdq_ts_off == tdq->tdq_ts_deq_off) {
-		tdq->tdq_ts_off = (tdq->tdq_ts_off + 1) % RQ_TS_POL_MODULO;
-		if (runq_is_queue_empty(&tdq->tdq_runq,
-		    tdq->tdq_ts_deq_off + RQ_TS_POL_MIN))
-			tdq->tdq_ts_deq_off = tdq->tdq_ts_off;
+		tdq->tdq_ts_ticks += cnt;
+		tdq->tdq_ts_off = (tdq->tdq_ts_off + 2 * cnt -
+		    tdq-> tdq_ts_ticks / 4) % RQ_TS_POL_MODULO;
+		tdq->tdq_ts_ticks %= 4;
+		tdq_advance_ts_deq_off(tdq, false);
 	}
 	ts = td_get_sched(td);
 	sched_pctcpu_update(ts, 1);

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?202506180213.55I2DYiL024750>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation