Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 18 Jun 2025 02:13:23 GMT
From:      Olivier Certner <olce@FreeBSD.org>
To:        src-committers@FreeBSD.org, dev-commits-src-all@FreeBSD.org, dev-commits-src-main@FreeBSD.org
Subject:   git: baecdea10eb5 - main - sched_ule: Use a single runqueue per CPU
Message-ID:  <202506180213.55I2DN1W024402@gitrepo.freebsd.org>

next in thread | raw e-mail | index | archive | help
The branch main has been updated by olce:

URL: https://cgit.FreeBSD.org/src/commit/?id=baecdea10eb50ac20c2ed1fcd4171d3a609e90b5

commit baecdea10eb50ac20c2ed1fcd4171d3a609e90b5
Author:     Olivier Certner <olce@FreeBSD.org>
AuthorDate: 2024-05-14 10:45:20 +0000
Commit:     Olivier Certner <olce@FreeBSD.org>
CommitDate: 2025-06-18 02:08:01 +0000

    sched_ule: Use a single runqueue per CPU
    
    Previously, ULE would use 3 separate runqueues per CPU to store threads,
    one for each of its selection policies, which are realtime, timesharing
    and idle.  They would be examined in this order, and the first thread
    found would be the one selected.
    
    This choice indeed appears as the easiest evolution from the single
    runqueue used by sched_4bsd (4BSD): It allows sharing most of the same
    runqueue code, which currently defines 64 levels per runqueue, while
    multiplying the number of levels (by 3).  However, it has several
    important drawbacks:
    
    1. The number of levels is the same for each selection policy.  64 is
    unnecessarily large for the idle policy (only 32 distinct levels would
    be necessary, given the 32 levels of our RTP_PRIO_IDLE and their future
    aliases in the to-be-introduced SCHED_IDLE POSIX scheduling policy) and
    unnecessary restrictive both for the realtime policy (which should
    include 32 distinct levels for PRI_REALTIME, given our implementation of
    SCHED_RR/SCHED_FIFO, leaving at most 32 levels for ULE's interactive
    processes where the current implementation provisions 48 (perhaps taking
    into account the spreading problem, see next point)) and the timesharing
    one (88 distinct levels currently provisioned).
    
    2. A runqueue has only 64 distinct levels, and maps priorities in the
    range [0;255] to a queue index by just performing a division by 4.
    Priorities mapped to the same level are treated exactly the same from
    a scheduling perspective, which is generally both unexpected and
    incorrect.  ULE's code tries to compensate for this aliasing in the
    timesharing selection policy, by spreading the 88 levels into 256,
    knowing the latter amount in the end to only 64 distinct ones.  This
    scaling is unfortunately not performed for the other policies, breaking
    the expectations mentioned in the previous point about distinct priority
    levels.
    
    With this change, only a single runqueue is now used to store all
    threads, regardless of the scheduling policy ULE applies to them (going
    back to what 4BSD has always been doing).  ULE's 3 selection policies
    are assigned non-overlapping ranges of levels, and helper functions have
    been created to select or steal a thread in these distinct ranges,
    preserving the "circular" queue mechanism for the timesharing selection
    policy that (tries to) prevent starvation in the face of permanent
    dynamic priority adjustments.
    
    This change allows to choose any arbitrary repartition of runqueue
    levels between selection policies.  It is a prerequisite to the increase
    to 256 levels per runqueue, which will allow to dispense with all the
    drawbacks listed above.
    
    MFC after:      1 month
    Event:          Kitchener-Waterloo Hackathon 202506
    Sponsored by:   The FreeBSD Foundation
    Differential Revision:  https://reviews.freebsd.org/D45389
---
 sys/kern/kern_switch.c |  11 --
 sys/kern/sched_ule.c   | 305 ++++++++++++++++++++++++++++++-------------------
 sys/sys/runq.h         |   1 -
 3 files changed, 187 insertions(+), 130 deletions(-)

diff --git a/sys/kern/kern_switch.c b/sys/kern/kern_switch.c
index 9a05f3c90a80..264925ab954a 100644
--- a/sys/kern/kern_switch.c
+++ b/sys/kern/kern_switch.c
@@ -652,14 +652,3 @@ runq_choose_fuzz(struct runq *rq, int fuzz)
 	CTR0(KTR_RUNQ, "runq_choose_fuzz: idlethread");
 	return (NULL);
 }
-
-struct thread *
-runq_choose_from(struct runq *const rq, int from_idx)
-{
-	struct thread *td;
-
-	td = runq_first_thread_range(rq, from_idx, RQ_NQS - 1);
-	if (td != NULL || from_idx == 0)
-		return (td);
-	return (runq_first_thread_range(rq, 0, from_idx - 1));
-}
diff --git a/sys/kern/sched_ule.c b/sys/kern/sched_ule.c
index 1b780f192352..cffeb97d1a59 100644
--- a/sys/kern/sched_ule.c
+++ b/sys/kern/sched_ule.c
@@ -88,10 +88,9 @@ dtrace_vtime_switch_func_t	dtrace_vtime_switch_func;
  * Thread scheduler specific section.  All fields are protected
  * by the thread lock.
  */
-struct td_sched {	
-	struct runq	*ts_runq;	/* Run-queue we're queued on. */
+struct td_sched {
 	short		ts_flags;	/* TSF_* flags. */
-	int		ts_cpu;		/* CPU that we have affinity for. */
+	int		ts_cpu;		/* CPU we are on, or were last on. */
 	int		ts_rltick;	/* Real last tick, for affinity. */
 	int		ts_slice;	/* Ticks of slice remaining. */
 	u_int		ts_slptime;	/* Number of ticks we vol. slept */
@@ -131,23 +130,6 @@ _Static_assert(sizeof(struct thread) + sizeof(struct td_sched) <=
 #define	PRI_MIN_BATCH		(PRI_MIN_TIMESHARE + PRI_INTERACT_RANGE)
 #define	PRI_MAX_BATCH		PRI_MAX_TIMESHARE
 
-/*
- * Cpu percentage computation macros and defines.
- *
- * SCHED_TICK_SECS:	Number of seconds to average the cpu usage across.
- * SCHED_TICK_TARG:	Number of hz ticks to average the cpu usage across.
- * SCHED_TICK_MAX:	Maximum number of ticks before scaling back.
- * SCHED_TICK_SHIFT:	Shift factor to avoid rounding away results.
- * SCHED_TICK_HZ:	Compute the number of hz ticks for a given ticks count.
- * SCHED_TICK_TOTAL:	Gives the amount of time we've been recording ticks.
- */
-#define	SCHED_TICK_SECS		10
-#define	SCHED_TICK_TARG		(hz * SCHED_TICK_SECS)
-#define	SCHED_TICK_MAX		(SCHED_TICK_TARG + hz)
-#define	SCHED_TICK_SHIFT	10
-#define	SCHED_TICK_HZ(ts)	((ts)->ts_ticks >> SCHED_TICK_SHIFT)
-#define	SCHED_TICK_TOTAL(ts)	(max((ts)->ts_ltick - (ts)->ts_ftick, hz))
-
 /*
  * These macros determine priorities for non-interactive threads.  They are
  * assigned a priority based on their recent cpu utilization as expressed
@@ -170,6 +152,48 @@ _Static_assert(sizeof(struct thread) + sizeof(struct td_sched) <=
     (roundup(SCHED_TICK_TOTAL((ts)), SCHED_PRI_RANGE) / SCHED_PRI_RANGE))
 #define	SCHED_PRI_NICE(nice)	(nice)
 
+/*
+ * Runqueue indices for the implemented scheduling policies' priority bounds.
+ *
+ * In ULE's implementation, realtime policy covers the ITHD, REALTIME and
+ * INTERACT (see above) ranges, timesharing the BATCH range (see above), and
+ * idle policy the IDLE range.
+ *
+ * Priorities from these ranges must not be assigned to the same runqueue's
+ * queue.
+ */
+#define	RQ_RT_POL_MIN		(RQ_PRI_TO_QUEUE_IDX(PRI_MIN_ITHD))
+#define	RQ_RT_POL_MAX		(RQ_PRI_TO_QUEUE_IDX(PRI_MAX_INTERACT))
+#define	RQ_TS_POL_MIN		(RQ_PRI_TO_QUEUE_IDX(PRI_MIN_BATCH))
+#define	RQ_TS_POL_MAX		(RQ_PRI_TO_QUEUE_IDX(PRI_MAX_BATCH))
+#define	RQ_ID_POL_MIN		(RQ_PRI_TO_QUEUE_IDX(PRI_MIN_IDLE))
+#define	RQ_ID_POL_MAX		(RQ_PRI_TO_QUEUE_IDX(PRI_MAX_IDLE))
+
+_Static_assert(RQ_RT_POL_MAX != RQ_TS_POL_MIN,
+    "ULE's realtime and timeshare policies' runqueue ranges overlap");
+_Static_assert(RQ_TS_POL_MAX != RQ_ID_POL_MIN,
+    "ULE's timeshare and idle policies' runqueue ranges overlap");
+
+/* Helper to treat the timeshare range as a circular group of queues. */
+#define RQ_TS_POL_MODULO	(RQ_TS_POL_MAX - RQ_TS_POL_MIN + 1)
+
+/*
+ * Cpu percentage computation macros and defines.
+ *
+ * SCHED_TICK_SECS:	Number of seconds to average the cpu usage across.
+ * SCHED_TICK_TARG:	Number of hz ticks to average the cpu usage across.
+ * SCHED_TICK_MAX:	Maximum number of ticks before scaling back.
+ * SCHED_TICK_SHIFT:	Shift factor to avoid rounding away results.
+ * SCHED_TICK_HZ:	Compute the number of hz ticks for a given ticks count.
+ * SCHED_TICK_TOTAL:	Gives the amount of time we've been recording ticks.
+ */
+#define	SCHED_TICK_SECS		10
+#define	SCHED_TICK_TARG		(hz * SCHED_TICK_SECS)
+#define	SCHED_TICK_MAX		(SCHED_TICK_TARG + hz)
+#define	SCHED_TICK_SHIFT	10
+#define	SCHED_TICK_HZ(ts)	((ts)->ts_ticks >> SCHED_TICK_SHIFT)
+#define	SCHED_TICK_TOTAL(ts)	(max((ts)->ts_ltick - (ts)->ts_ftick, hz))
+
 /*
  * These determine the interactivity of a process.  Interactivity differs from
  * cpu utilization in that it expresses the voluntary time slept vs time ran
@@ -253,12 +277,10 @@ struct tdq {
 	short		tdq_oldswitchcnt; /* (l) Switches last tick. */
 	u_char		tdq_lowpri;	/* (ts) Lowest priority thread. */
 	u_char		tdq_owepreempt;	/* (f) Remote preemption pending. */
-	u_char		tdq_idx;	/* (t) Current insert index. */
-	u_char		tdq_ridx;	/* (t) Current removal index. */
+	u_char		tdq_ts_off;	/* (t) TS insertion offset. */
+	u_char		tdq_ts_deq_off;	/* (t) TS dequeue offset. */
 	int		tdq_id;		/* (c) cpuid. */
-	struct runq	tdq_realtime;	/* (t) real-time run queue. */
-	struct runq	tdq_timeshare;	/* (t) timeshare run queue. */
-	struct runq	tdq_idle;	/* (t) Queue of IDLE threads. */
+	struct runq	tdq_runq;	/* (t) Run queue. */
 	char		tdq_name[TDQ_NAME_LEN];
 #ifdef KTR
 	char		tdq_loadname[TDQ_LOADNAME_LEN];
@@ -330,12 +352,17 @@ static void sched_interact_fork(struct thread *);
 static void sched_pctcpu_update(struct td_sched *, int);
 
 /* Operations on per processor queues */
+static inline struct thread *runq_choose_realtime(struct runq *const rq);
+static inline struct thread *runq_choose_timeshare(struct runq *const rq,
+    int off);
+static inline struct thread *runq_choose_idle(struct runq *const rq);
 static struct thread *tdq_choose(struct tdq *);
+
 static void tdq_setup(struct tdq *, int i);
 static void tdq_load_add(struct tdq *, struct thread *);
 static void tdq_load_rem(struct tdq *, struct thread *);
-static __inline void tdq_runq_add(struct tdq *, struct thread *, int);
-static __inline void tdq_runq_rem(struct tdq *, struct thread *);
+static inline void tdq_runq_add(struct tdq *, struct thread *, int);
+static inline void tdq_runq_rem(struct tdq *, struct thread *);
 static inline int sched_shouldpreempt(int, int, int);
 static void tdq_print(int cpu);
 static void runq_print(struct runq *rq);
@@ -344,8 +371,19 @@ static int tdq_add(struct tdq *, struct thread *, int);
 static int tdq_move(struct tdq *, struct tdq *);
 static int tdq_idled(struct tdq *);
 static void tdq_notify(struct tdq *, int lowpri);
+
+static bool runq_steal_pred(const int idx, struct rq_queue *const q,
+    void *const data);
+static inline struct thread *runq_steal_range(struct runq *const rq,
+    const int lvl_min, const int lvl_max, int cpu);
+static inline struct thread *runq_steal_realtime(struct runq *const rq,
+    int cpu);
+static inline struct thread *runq_steal_timeshare(struct runq *const rq,
+    int cpu, int off);
+static inline struct thread *runq_steal_idle(struct runq *const rq,
+    int cpu);
 static struct thread *tdq_steal(struct tdq *, int);
-static struct thread *runq_steal(struct runq *, int);
+
 static int sched_pickcpu(struct thread *, int);
 static void sched_balance(void);
 static bool sched_balance_pair(struct tdq *, struct tdq *);
@@ -420,21 +458,17 @@ tdq_print(int cpu)
 	tdq = TDQ_CPU(cpu);
 
 	printf("tdq %d:\n", TDQ_ID(tdq));
-	printf("\tlock            %p\n", TDQ_LOCKPTR(tdq));
-	printf("\tLock name:      %s\n", tdq->tdq_name);
-	printf("\tload:           %d\n", tdq->tdq_load);
-	printf("\tswitch cnt:     %d\n", tdq->tdq_switchcnt);
-	printf("\told switch cnt: %d\n", tdq->tdq_oldswitchcnt);
-	printf("\ttimeshare idx:  %d\n", tdq->tdq_idx);
-	printf("\ttimeshare ridx: %d\n", tdq->tdq_ridx);
+	printf("\tlock               %p\n", TDQ_LOCKPTR(tdq));
+	printf("\tLock name:         %s\n", tdq->tdq_name);
+	printf("\tload:              %d\n", tdq->tdq_load);
+	printf("\tswitch cnt:        %d\n", tdq->tdq_switchcnt);
+	printf("\told switch cnt:    %d\n", tdq->tdq_oldswitchcnt);
+	printf("\tTS insert offset:  %d\n", tdq->tdq_ts_off);
+	printf("\tTS dequeue offset: %d\n", tdq->tdq_ts_deq_off);
 	printf("\tload transferable: %d\n", tdq->tdq_transferable);
 	printf("\tlowest priority:   %d\n", tdq->tdq_lowpri);
-	printf("\trealtime runq:\n");
-	runq_print(&tdq->tdq_realtime);
-	printf("\ttimeshare runq:\n");
-	runq_print(&tdq->tdq_timeshare);
-	printf("\tidle runq:\n");
-	runq_print(&tdq->tdq_idle);
+	printf("\trunq:\n");
+	runq_print(&tdq->tdq_runq);
 }
 
 static inline int
@@ -475,11 +509,11 @@ sched_shouldpreempt(int pri, int cpri, int remote)
  * date with what is actually on the run-queue.  Selects the correct
  * queue position for timeshare threads.
  */
-static __inline void
+static inline void
 tdq_runq_add(struct tdq *tdq, struct thread *td, int flags)
 {
 	struct td_sched *ts;
-	u_char pri;
+	u_char pri, idx;
 
 	TDQ_LOCK_ASSERT(tdq, MA_OWNED);
 	THREAD_LOCK_BLOCKED_ASSERT(td, MA_OWNED);
@@ -491,34 +525,36 @@ tdq_runq_add(struct tdq *tdq, struct thread *td, int flags)
 		tdq->tdq_transferable++;
 		ts->ts_flags |= TSF_XFERABLE;
 	}
-	if (pri < PRI_MIN_BATCH) {
-		ts->ts_runq = &tdq->tdq_realtime;
-	} else if (pri <= PRI_MAX_BATCH) {
-		ts->ts_runq = &tdq->tdq_timeshare;
-		KASSERT(pri <= PRI_MAX_BATCH && pri >= PRI_MIN_BATCH,
-			("Invalid priority %d on timeshare runq", pri));
+	if (PRI_MIN_BATCH <= pri && pri <= PRI_MAX_BATCH) {
 		/*
-		 * This queue contains only priorities between MIN and MAX
-		 * batch.  Use the whole queue to represent these values.
+		 * The queues allocated to the batch range are not used as
+		 * a simple array but as a "circular" one where the insertion
+		 * index (derived from 'pri') is offset by 'tdq_ts_off'. 'idx'
+		 * is first set to the offset of the wanted queue in the TS'
+		 * selection policy range.
 		 */
-		if ((flags & (SRQ_BORROWING|SRQ_PREEMPTED)) == 0) {
-			pri = RQ_NQS * (pri - PRI_MIN_BATCH) / PRI_BATCH_RANGE;
-			pri = (pri + tdq->tdq_idx) % RQ_NQS;
+		if ((flags & (SRQ_BORROWING|SRQ_PREEMPTED)) != 0)
+			/* Current queue from which processes are being run. */
+			idx = tdq->tdq_ts_deq_off;
+		else {
+			idx = (RQ_PRI_TO_QUEUE_IDX(pri) - RQ_TS_POL_MIN +
+			    tdq->tdq_ts_off) % RQ_TS_POL_MODULO;
 			/*
-			 * This effectively shortens the queue by one so we
-			 * can have a one slot difference between idx and
-			 * ridx while we wait for threads to drain.
+			 * We avoid enqueuing low priority threads in the queue
+			 * that we are still draining, effectively shortening
+			 * the runqueue by one queue.
 			 */
-			if (tdq->tdq_ridx != tdq->tdq_idx &&
-			    pri == tdq->tdq_ridx)
-				pri = (unsigned char)(pri - 1) % RQ_NQS;
-		} else
-			pri = tdq->tdq_ridx;
-		runq_add_idx(ts->ts_runq, td, pri, flags);
-		return;
+			if (tdq->tdq_ts_deq_off != tdq->tdq_ts_off &&
+			    idx == tdq->tdq_ts_deq_off)
+				/* Ensure the dividend is positive. */
+				idx = (idx - 1 + RQ_TS_POL_MODULO) %
+				    RQ_TS_POL_MODULO;
+		}
+		/* Absolute queue index. */
+		idx += RQ_TS_POL_MIN;
+		runq_add_idx(&tdq->tdq_runq, td, idx, flags);
 	} else
-		ts->ts_runq = &tdq->tdq_idle;
-	runq_add(ts->ts_runq, td, flags);
+		runq_add(&tdq->tdq_runq, td, flags);
 }
 
 /*
@@ -526,7 +562,7 @@ tdq_runq_add(struct tdq *tdq, struct thread *td, int flags)
  * is selected to run.  Running threads are not on the queue and the
  * transferable count does not reflect them.
  */
-static __inline void
+static inline void
 tdq_runq_rem(struct tdq *tdq, struct thread *td)
 {
 	struct td_sched *ts;
@@ -535,13 +571,11 @@ tdq_runq_rem(struct tdq *tdq, struct thread *td)
 	ts = td_get_sched(td);
 	TDQ_LOCK_ASSERT(tdq, MA_OWNED);
 	THREAD_LOCK_BLOCKED_ASSERT(td, MA_OWNED);
-	KASSERT(ts->ts_runq != NULL,
-	    ("tdq_runq_remove: thread %p null ts_runq", td));
 	if (ts->ts_flags & TSF_XFERABLE) {
 		tdq->tdq_transferable--;
 		ts->ts_flags &= ~TSF_XFERABLE;
 	}
-	queue_empty = runq_remove(ts->ts_runq, td);
+	queue_empty = runq_remove(&tdq->tdq_runq, td);
 	/*
 	 * If thread has a batch priority and the queue from which it was
 	 * removed is now empty, advance the batch's queue removal index if it
@@ -549,8 +583,10 @@ tdq_runq_rem(struct tdq *tdq, struct thread *td)
 	 */
 	if (queue_empty && PRI_MIN_BATCH <= td->td_priority &&
 	    td->td_priority <= PRI_MAX_BATCH &&
-	    tdq->tdq_idx != tdq->tdq_ridx && tdq->tdq_ridx == td->td_rqindex)
-		tdq->tdq_ridx = (tdq->tdq_ridx + 1) % RQ_NQS;
+	    tdq->tdq_ts_off != tdq->tdq_ts_deq_off &&
+	    tdq->tdq_ts_deq_off + RQ_TS_POL_MIN == td->td_rqindex)
+		tdq->tdq_ts_deq_off = (tdq->tdq_ts_deq_off + 1) %
+		    RQ_TS_POL_MODULO;
 }
 
 /*
@@ -1205,11 +1241,11 @@ runq_steal_pred(const int idx, struct rq_queue *const q, void *const data)
 }
 
 /*
- * Steals load from a timeshare queue.  Honors the rotating queue head
- * index.
+ * Steals load contained in queues with indices in the specified range.
  */
 static inline struct thread *
-runq_steal_from(struct runq *const rq, int cpu, int start_idx)
+runq_steal_range(struct runq *const rq, const int lvl_min, const int lvl_max,
+    int cpu)
 {
 	struct runq_steal_pred_data data = {
 		.td = NULL,
@@ -1217,46 +1253,50 @@ runq_steal_from(struct runq *const rq, int cpu, int start_idx)
 	};
 	int idx;
 
-	idx = runq_findq(rq, start_idx, RQ_NQS - 1, &runq_steal_pred, &data);
-	if (idx != -1)
-		goto found;
-
-	MPASS(data.td == NULL);
-	if (start_idx != 0) {
-		idx = runq_findq(rq, 0, start_idx - 1, &runq_steal_pred, &data);
-		if (idx != -1)
-			goto found;
+	idx = runq_findq(rq, lvl_min, lvl_max, &runq_steal_pred, &data);
+	if (idx != -1) {
+		MPASS(data.td != NULL);
+		return (data.td);
 	}
 
-	MPASS(idx == -1 && data.td == NULL);
+	MPASS(data.td == NULL);
 	return (NULL);
-found:
-	MPASS(data.td != NULL);
-	return (data.td);
+}
+
+static inline struct thread *
+runq_steal_realtime(struct runq *const rq, int cpu)
+{
+
+	return (runq_steal_range(rq, RQ_RT_POL_MIN, RQ_RT_POL_MAX, cpu));
 }
 
 /*
- * Steals load from a standard linear queue.
+ * Steals load from a timeshare queue.  Honors the rotating queue head
+ * index.
  */
-static struct thread *
-runq_steal(struct runq *rq, int cpu)
+static inline struct thread *
+runq_steal_timeshare(struct runq *const rq, int cpu, int off)
 {
-	struct runq_steal_pred_data data = {
-		.td = NULL,
-		.cpu = cpu,
-	};
-	int idx;
+	struct thread *td;
 
-	idx = runq_findq(rq, 0, RQ_NQS - 1, &runq_steal_pred, &data);
-	if (idx != -1) {
-		MPASS(data.td != NULL);
-		return (data.td);
-	}
+	MPASS(0 <= off && off < RQ_TS_POL_MODULO);
 
-	MPASS(data.td == NULL);
-	return (NULL);
+	td = runq_steal_range(rq, RQ_TS_POL_MIN + off, RQ_TS_POL_MAX, cpu);
+	if (td != NULL || off == 0)
+		return (td);
+
+	td = runq_steal_range(rq, RQ_TS_POL_MIN, RQ_TS_POL_MIN + off - 1, cpu);
+	return (td);
+}
+
+static inline struct thread *
+runq_steal_idle(struct runq *const rq, int cpu)
+{
+
+	return (runq_steal_range(rq, RQ_ID_POL_MIN, RQ_ID_POL_MAX, cpu));
 }
 
+
 /*
  * Attempt to steal a thread in priority order from a thread queue.
  */
@@ -1266,12 +1306,13 @@ tdq_steal(struct tdq *tdq, int cpu)
 	struct thread *td;
 
 	TDQ_LOCK_ASSERT(tdq, MA_OWNED);
-	if ((td = runq_steal(&tdq->tdq_realtime, cpu)) != NULL)
+	td = runq_steal_realtime(&tdq->tdq_runq, cpu);
+	if (td != NULL)
 		return (td);
-	if ((td = runq_steal_from(&tdq->tdq_timeshare,
-	    cpu, tdq->tdq_ridx)) != NULL)
+	td = runq_steal_timeshare(&tdq->tdq_runq, cpu, tdq->tdq_ts_deq_off);
+	if (td != NULL)
 		return (td);
-	return (runq_steal(&tdq->tdq_idle, cpu));
+	return (runq_steal_idle(&tdq->tdq_runq, cpu));
 }
 
 /*
@@ -1453,6 +1494,35 @@ llc:
 }
 #endif
 
+static inline struct thread *
+runq_choose_realtime(struct runq *const rq)
+{
+
+	return (runq_first_thread_range(rq, RQ_RT_POL_MIN, RQ_RT_POL_MAX));
+}
+
+static struct thread *
+runq_choose_timeshare(struct runq *const rq, int off)
+{
+	struct thread *td;
+
+	MPASS(0 <= off && off < RQ_TS_POL_MODULO);
+
+	td = runq_first_thread_range(rq, RQ_TS_POL_MIN + off, RQ_TS_POL_MAX);
+	if (td != NULL || off == 0)
+		return (td);
+
+	td = runq_first_thread_range(rq, RQ_TS_POL_MIN, RQ_TS_POL_MIN + off - 1);
+	return (td);
+}
+
+static inline struct thread *
+runq_choose_idle(struct runq *const rq)
+{
+
+	return (runq_first_thread_range(rq, RQ_ID_POL_MIN, RQ_ID_POL_MAX));
+}
+
 /*
  * Pick the highest priority task we have and return it.
  */
@@ -1462,17 +1532,17 @@ tdq_choose(struct tdq *tdq)
 	struct thread *td;
 
 	TDQ_LOCK_ASSERT(tdq, MA_OWNED);
-	td = runq_choose(&tdq->tdq_realtime);
+	td = runq_choose_realtime(&tdq->tdq_runq);
 	if (td != NULL)
 		return (td);
-	td = runq_choose_from(&tdq->tdq_timeshare, tdq->tdq_ridx);
+	td = runq_choose_timeshare(&tdq->tdq_runq, tdq->tdq_ts_deq_off);
 	if (td != NULL) {
 		KASSERT(td->td_priority >= PRI_MIN_BATCH,
 		    ("tdq_choose: Invalid priority on timeshare queue %d",
 		    td->td_priority));
 		return (td);
 	}
-	td = runq_choose(&tdq->tdq_idle);
+	td = runq_choose_idle(&tdq->tdq_runq);
 	if (td != NULL) {
 		KASSERT(td->td_priority >= PRI_MIN_IDLE,
 		    ("tdq_choose: Invalid priority on idle queue %d",
@@ -1492,9 +1562,7 @@ tdq_setup(struct tdq *tdq, int id)
 
 	if (bootverbose)
 		printf("ULE: setup cpu %d\n", id);
-	runq_init(&tdq->tdq_realtime);
-	runq_init(&tdq->tdq_timeshare);
-	runq_init(&tdq->tdq_idle);
+	runq_init(&tdq->tdq_runq);
 	tdq->tdq_id = id;
 	snprintf(tdq->tdq_name, sizeof(tdq->tdq_name),
 	    "sched lock %d", (int)TDQ_ID(tdq));
@@ -2595,13 +2663,14 @@ sched_clock(struct thread *td, int cnt)
 	tdq->tdq_switchcnt = tdq->tdq_load;
 
 	/*
-	 * Advance the insert index once for each tick to ensure that all
+	 * Advance the insert offset once for each tick to ensure that all
 	 * threads get a chance to run.
 	 */
-	if (tdq->tdq_idx == tdq->tdq_ridx) {
-		tdq->tdq_idx = (tdq->tdq_idx + 1) % RQ_NQS;
-		if (runq_is_queue_empty(&tdq->tdq_timeshare, tdq->tdq_ridx))
-			tdq->tdq_ridx = tdq->tdq_idx;
+	if (tdq->tdq_ts_off == tdq->tdq_ts_deq_off) {
+		tdq->tdq_ts_off = (tdq->tdq_ts_off + 1) % RQ_TS_POL_MODULO;
+		if (runq_is_queue_empty(&tdq->tdq_runq,
+		    tdq->tdq_ts_deq_off + RQ_TS_POL_MIN))
+			tdq->tdq_ts_deq_off = tdq->tdq_ts_off;
 	}
 	ts = td_get_sched(td);
 	sched_pctcpu_update(ts, 1);
diff --git a/sys/sys/runq.h b/sys/sys/runq.h
index 0a7f70fcfa16..5f415740099f 100644
--- a/sys/sys/runq.h
+++ b/sys/sys/runq.h
@@ -117,7 +117,6 @@ struct thread	*runq_first_thread_range(struct runq *const rq,
 bool		 runq_not_empty(struct runq *);
 struct thread	*runq_choose(struct runq *);
 struct thread	*runq_choose_fuzz(struct runq *, int _fuzz);
-struct thread	*runq_choose_from(struct runq *, int _idx);
 #endif /* _KERNEL */
 
 #endif



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?202506180213.55I2DN1W024402>