Date: Thu, 18 Jul 2002 16:31:57 -0700 From: Luigi Rizzo <rizzo@icir.org> To: arch@FreeBSD.ORG Subject: NEW SCHEDULER available (was Re: proposed changes to kern_switch.c and kern_synch.c) Message-ID: <20020718163157.A27017@iguana.icir.org> In-Reply-To: <20020716235216.B6785@iguana.icir.org>; from rizzo@icir.org on Tue, Jul 16, 2002 at 11:52:16PM -0700 References: <20020716235216.B6785@iguana.icir.org>
next in thread | previous in thread | raw e-mail | index | archive | help
[crossposted to -stable as relevant there]
Hi,
as promised, a first version of the Proportional Share scheduler
that we developed is available at
http://info.iet.unipi.it/~luigi/ps_sched.20020719a.diff
These are for a recent -STABLE (i think any version from 4.4 should
work; the only 3 files modified are kern_synch.c, kern_switch.c and
proc.h, plus a one-line change to kern_exit.c).
I have tested it a little bit on a diskless system, and it seems
to survive running a full X session with the usual set of xterm,
netscape etc. while i do a "renice" of the processes and even switch
back and forth between schedulers. But do not trust this yet for a
production system!
The sysctl variable kern.scheduler controls the scheduler in use.
kern.scheduler=1 (default) selects Proportional Share
kern.scheduler=0 selects the Feedback Priority ("classic BSD")
You can switch between the two schedulers by changing the value of
the sysctl variable.
To change the "weight" associated to a process, use the "nice" and
"renice" commands. As usual, positive values (+1..+20) mean the
process will get a smaller share of the CPU, negative values (-1..-20)
will get a larger share. The actual weights (which control the
relative fraction of CPU cycles given to each process) are assigned
through a lookup table which maps the value to the range
1 ... 100 ... 1000 (min, normal, max weight).
The "old" scheduler (Feedback Priority) should be as robust and
stable as always, whereas there are still a few things to cleanup
in the Proportional Share scheduler, namely:
* I have not looked in detail at the SMP case, so do not run
it on an SMP kernel;
* given that there are no priorities, processes woken up by a
tsleep() are not guaranteed to run before all processes
with p_priority >= PUSER; however, they are given a shorter
deadline so they are likely to run first.
* RTPRI and IDLEPRI processes do not have yet any special treatment
(they all end up together with normal processes; this is trivial
to fix, i just haven't had time to look at that).
Have fun, and please if you use it let me know of any bugs and
possibly suggestions to adapt it to -current.
cheers
luigi
On Tue, Jul 16, 2002 at 11:52:16PM -0700, Luigi Rizzo wrote:
> Hi,
> we have implemented a weight-based process scheduler
> for FreeBSD-stable, and are trying to port it to -current (in the
> description below replace "process" with "thread/kse" as appropriate
> for the -current case).
>
> In order to make this work, it is convenient to have all
> scheduler-specific functions and data structures in a
> single file (kern_switch*.c), and generic support in
> another one (kern_synch.c). I believe this was also the original
> BSD design in partitioning the code between the two files.
> However, in our current code, there are some functions which
> are scheduler-specific (see below) which are misplaced.
>
> So I would like to make the following, purely cosmetic, change
> identified as step #1 below, both for consistency with what i
> believe to be the original design, and to simplify further work
> in this area. These would go to both -current and -stable; I
> repeat, they are only cosmetic changes.
>
> Comments ? I already spoke to julian who has no objections.
>
> For those interested, a patch for -current is attached, and the one
> for -stable is very similar (for the records, in -stable we have a
> fully functional weight-based process scheduler, where you still
> control the weights using "nice", and you can switch between the
> old and the new scheduler at any time using a sysctl variable).
>
> --- Re. the multi-scheduler architecture ---
>
> The general idea is to make the process/thread/kse scheduler
> a replaceable piece of the kernel, requiring no modifications
> to the "struct proc", and with the ability of switching from
> one scheduler to another one at runtime (this both for testing
> purposes and for whatever need may arise).
>
>
> The way to achieve this would be the following:
>
> 1. identify all scheduler-specific functions, put all of them
> in one file (e.g. kern_switch.c for the default scheduler),
> and define function pointers for all of them;
>
> 2. use one field in "struct proc" as a link to whatever
> scheduler-specific data structures are necessary.
> In -stable, we have the p_pad3 field that can be used
> for this purpose, much like the if_index field in struct ifnet.
>
> 3. implement a function which, under control of a sysctl call,
> activate a new scheduler (by initialising all the function
> pointers to appropriate values) and moves all processes from
> the old to the new one.
>
> Step #1 is basically a cosmetic change, requiring mostly a move of
> some functions from kern_synch.c to kern_switch.c. These are
>
> roundrobin();
>
> curpriority_cmp();
>
> resetpriority();
>
> parts of schedcpu() and schedclock();
>
> cheers
> luigi
> ----------------------------------
>
> Index: kern_switch.c
> ===================================================================
> RCS file: /home/ncvs/src/sys/kern/kern_switch.c,v
> retrieving revision 1.33
> diff -u -r1.33 kern_switch.c
> --- kern_switch.c 14 Jul 2002 03:43:33 -0000 1.33
> +++ kern_switch.c 16 Jul 2002 22:15:06 -0000
> @@ -97,6 +97,7 @@
> #include <sys/mutex.h>
> #include <sys/proc.h>
> #include <sys/queue.h>
> +#include <sys/resourcevar.h>
> #include <machine/critical.h>
>
> CTASSERT((RQB_BPW * RQB_LEN) == RQ_NQS);
> @@ -107,6 +108,9 @@
> static struct runq runq;
> SYSINIT(runq, SI_SUB_RUN_QUEUE, SI_ORDER_FIRST, runq_init, &runq)
>
> +static struct callout roundrobin_callout;
> +
> +static void roundrobin(void *arg);
> static void runq_readjust(struct runq *rq, struct kse *ke);
> /************************************************************************
> * Functions that manipulate runnability from a thread perspective. *
> @@ -442,9 +446,13 @@
> {
> int i;
>
> + callout_init(&roundrobin_callout, 0);
> +
> bzero(rq, sizeof *rq);
> for (i = 0; i < RQ_NQS; i++)
> TAILQ_INIT(&rq->rq_queues[i]);
> +
> + roundrobin(NULL);
> }
>
> /*
> @@ -719,3 +727,207 @@
> }
> #endif
>
> +/*
> + * -- formerly in kern_synch.c
> + */
> +
> +int curpriority_cmp(struct thread *td);
> +
> +int
> +curpriority_cmp(struct thread *td)
> +{
> + return (td->td_priority - curthread->td_priority);
> +}
> +
> +/*
> + * Force switch among equal priority processes every 100ms.
> + * We don't actually need to force a context switch of the current process.
> + * The act of firing the event triggers a context switch to softclock() and
> + * then switching back out again which is equivalent to a preemption, thus
> + * no further work is needed on the local CPU.
> + */
> +/* ARGSUSED */
> +static void
> +roundrobin(arg)
> + void *arg;
> +{
> +
> +#ifdef SMP
> + mtx_lock_spin(&sched_lock);
> + forward_roundrobin();
> + mtx_unlock_spin(&sched_lock);
> +#endif
> +
> + callout_reset(&roundrobin_callout, sched_quantum, roundrobin, NULL);
> +}
> +
> +/* calculations for digital decay to forget 90% of usage in 5*loadav sec */
> +#define loadfactor(loadav) (2 * (loadav))
> +#define decay_cpu(loadfac, cpu) (((loadfac) * (cpu)) / ((loadfac) + FSCALE))
> +extern fixpt_t ccpu;
> +#define CCPU_SHIFT 11 /* XXX dup from kern_synch.c! */
> +
> +/*
> + * Recompute process priorities, every hz ticks.
> + * MP-safe, called without the Giant mutex.
> + */
> +void schedcpu1(struct ksegrp *kg);
> +/* ARGSUSED */
> +void
> +schedcpu1(struct ksegrp *kg)
> +{
> + register fixpt_t loadfac = loadfactor(averunnable.ldavg[0]);
> + struct thread *td;
> + struct kse *ke;
> + int realstathz;
> + int awake;
> +
> + realstathz = stathz ? stathz : hz;
> +
> + awake = 0;
> + FOREACH_KSE_IN_GROUP(kg, ke) {
> + /*
> + * Increment time in/out of memory and sleep
> + * time (if sleeping). We ignore overflow;
> + * with 16-bit int's (remember them?)
> + * overflow takes 45 days.
> + */
> + /* XXXKSE **WRONG***/
> + /*
> + * the kse slptimes are not touched in wakeup
> + * because the thread may not HAVE a KSE
> + */
> + if ((ke->ke_state == KES_ONRUNQ) ||
> + ((ke->ke_state == KES_THREAD) &&
> + (ke->ke_thread->td_state == TDS_RUNNING))) {
> + ke->ke_slptime++;
> + } else {
> + ke->ke_slptime = 0;
> + awake = 1;
> + }
> +
> + /*
> + * pctcpu is only for ps?
> + * Do it per kse.. and add them up at the end?
> + * XXXKSE
> + */
> + ke->ke_pctcpu = (ke->ke_pctcpu * ccpu) >> FSHIFT;
> + /*
> + * If the kse has been idle the entire second,
> + * stop recalculating its priority until
> + * it wakes up.
> + */
> + if (ke->ke_slptime > 1) {
> + continue;
> + }
> +
> +#if (FSHIFT >= CCPU_SHIFT)
> + ke->ke_pctcpu += (realstathz == 100) ?
> + ((fixpt_t) ke->ke_cpticks) <<
> + (FSHIFT - CCPU_SHIFT) :
> + 100 * (((fixpt_t) ke->ke_cpticks) <<
> + (FSHIFT - CCPU_SHIFT)) / realstathz;
> +#else
> + ke->ke_pctcpu += ((FSCALE - ccpu) *
> + (ke->ke_cpticks * FSCALE / realstathz)) >>
> + FSHIFT;
> +#endif
> + ke->ke_cpticks = 0;
> + } /* end of kse loop */
> + if (awake == 0) {
> + kg->kg_slptime++;
> + } else {
> + kg->kg_slptime = 0;
> + }
> + kg->kg_estcpu = decay_cpu(loadfac, kg->kg_estcpu);
> + resetpriority(kg);
> + FOREACH_THREAD_IN_GROUP(kg, td) {
> + int changedqueue;
> + if (td->td_priority >= PUSER) {
> + /*
> + * Only change the priority
> + * of threads that are still at their
> + * user priority.
> + * XXXKSE This is problematic
> + * as we may need to re-order
> + * the threads on the KSEG list.
> + */
> + changedqueue =
> + ((td->td_priority / RQ_PPQ) !=
> + (kg->kg_user_pri / RQ_PPQ));
> +
> + td->td_priority = kg->kg_user_pri;
> + if (changedqueue &&
> + td->td_state == TDS_RUNQ) {
> + /* this could be optimised */
> + remrunqueue(td);
> + td->td_priority =
> + kg->kg_user_pri;
> + setrunqueue(td);
> + } else {
> + td->td_priority = kg->kg_user_pri;
> + }
> + }
> + }
> +}
> +
> +/*
> + * Compute the priority of a process when running in user mode.
> + * Arrange to reschedule if the resulting priority is better
> + * than that of the current process.
> + */
> +void
> +resetpriority(kg)
> + register struct ksegrp *kg;
> +{
> + register unsigned int newpriority;
> + struct thread *td;
> +
> + mtx_lock_spin(&sched_lock);
> + if (kg->kg_pri_class == PRI_TIMESHARE) {
> + newpriority = PUSER + kg->kg_estcpu / INVERSE_ESTCPU_WEIGHT +
> + NICE_WEIGHT * (kg->kg_nice - PRIO_MIN);
> + newpriority = min(max(newpriority, PRI_MIN_TIMESHARE),
> + PRI_MAX_TIMESHARE);
> + kg->kg_user_pri = newpriority;
> + }
> + FOREACH_THREAD_IN_GROUP(kg, td) {
> + maybe_resched(td); /* XXXKSE silly */
> + }
> + mtx_unlock_spin(&sched_lock);
> +}
> +
> +#if 0
> +/*
> + * We adjust the priority of the current process. The priority of
> + * a process gets worse as it accumulates CPU time. The cpu usage
> + * estimator (p_estcpu) is increased here. resetpriority() will
> + * compute a different priority each time p_estcpu increases by
> + * INVERSE_ESTCPU_WEIGHT
> + * (until MAXPRI is reached). The cpu usage estimator ramps up
> + * quite quickly when the process is running (linearly), and decays
> + * away exponentially, at a rate which is proportionally slower when
> + * the system is busy. The basic principle is that the system will
> + * 90% forget that the process used a lot of CPU time in 5 * loadav
> + * seconds. This causes the system to favor processes which haven't
> + * run much recently, and to round-robin among other processes.
> + */
> +void
> +schedclock(td)
> + struct thread *td;
> +{
> + struct kse *ke;
> + struct ksegrp *kg;
> +
> + KASSERT((td != NULL), ("schedlock: null thread pointer"));
> + ke = td->td_kse;
> + kg = td->td_ksegrp;
> + ke->ke_cpticks++;
> + kg->kg_estcpu = ESTCPULIM(kg->kg_estcpu + 1);
> + if ((kg->kg_estcpu % INVERSE_ESTCPU_WEIGHT) == 0) {
> + resetpriority(kg);
> + if (td->td_priority >= PUSER)
> + td->td_priority = kg->kg_user_pri;
> + }
> +}
> +#endif
> Index: kern_synch.c
> ===================================================================
> RCS file: /home/ncvs/src/sys/kern/kern_synch.c,v
> retrieving revision 1.185
> diff -u -r1.185 kern_synch.c
> --- kern_synch.c 14 Jul 2002 03:43:33 -0000 1.185
> +++ kern_synch.c 16 Jul 2002 22:16:46 -0000
> @@ -67,6 +67,8 @@
>
> #include <machine/cpu.h>
>
> +void schedcpu1(struct ksegrp *kg); /* XXX in kern_switch.c */
> +
> static void sched_setup(void *dummy);
> SYSINIT(sched_setup, SI_SUB_KICK_SCHEDULER, SI_ORDER_FIRST, sched_setup, NULL)
>
> @@ -76,7 +78,6 @@
>
> static struct callout loadav_callout;
> static struct callout schedcpu_callout;
> -static struct callout roundrobin_callout;
>
> struct loadavg averunnable =
> { {0, 0, 0}, FSCALE }; /* load average, of runnable procs */
> @@ -92,7 +93,6 @@
>
> static void endtsleep(void *);
> static void loadav(void *arg);
> -static void roundrobin(void *arg);
> static void schedcpu(void *arg);
>
> static int
> @@ -135,28 +135,6 @@
> }
>
> /*
> - * Force switch among equal priority processes every 100ms.
> - * We don't actually need to force a context switch of the current process.
> - * The act of firing the event triggers a context switch to softclock() and
> - * then switching back out again which is equivalent to a preemption, thus
> - * no further work is needed on the local CPU.
> - */
> -/* ARGSUSED */
> -static void
> -roundrobin(arg)
> - void *arg;
> -{
> -
> -#ifdef SMP
> - mtx_lock_spin(&sched_lock);
> - forward_roundrobin();
> - mtx_unlock_spin(&sched_lock);
> -#endif
> -
> - callout_reset(&roundrobin_callout, sched_quantum, roundrobin, NULL);
> -}
> -
> -/*
> * Constants for digital decay and forget:
> * 90% of (p_estcpu) usage in 5 * loadav time
> * 95% of (p_pctcpu) usage in 60 seconds (load insensitive)
> @@ -225,7 +203,7 @@
> #define decay_cpu(loadfac, cpu) (((loadfac) * (cpu)) / ((loadfac) + FSCALE))
>
> /* decay 95% of `p_pctcpu' in 60 seconds; see CCPU_SHIFT before changing */
> -static fixpt_t ccpu = 0.95122942450071400909 * FSCALE; /* exp(-1/20) */
> +fixpt_t ccpu = 0.95122942450071400909 * FSCALE; /* exp(-1/20) */
> SYSCTL_INT(_kern, OID_AUTO, ccpu, CTLFLAG_RD, &ccpu, 0, "");
>
> /* kernel uses `FSCALE', userland (SHOULD) use kern.fscale */
> @@ -255,105 +233,15 @@
> schedcpu(arg)
> void *arg;
> {
> - register fixpt_t loadfac = loadfactor(averunnable.ldavg[0]);
> - struct thread *td;
> struct proc *p;
> - struct kse *ke;
> struct ksegrp *kg;
> - int realstathz;
> - int awake;
>
> - realstathz = stathz ? stathz : hz;
> sx_slock(&allproc_lock);
> FOREACH_PROC_IN_SYSTEM(p) {
> mtx_lock_spin(&sched_lock);
> p->p_swtime++;
> FOREACH_KSEGRP_IN_PROC(p, kg) {
> - awake = 0;
> - FOREACH_KSE_IN_GROUP(kg, ke) {
> - /*
> - * Increment time in/out of memory and sleep
> - * time (if sleeping). We ignore overflow;
> - * with 16-bit int's (remember them?)
> - * overflow takes 45 days.
> - */
> - /* XXXKSE **WRONG***/
> - /*
> - * the kse slptimes are not touched in wakeup
> - * because the thread may not HAVE a KSE
> - */
> - if ((ke->ke_state == KES_ONRUNQ) ||
> - ((ke->ke_state == KES_THREAD) &&
> - (ke->ke_thread->td_state == TDS_RUNNING))) {
> - ke->ke_slptime++;
> - } else {
> - ke->ke_slptime = 0;
> - awake = 1;
> - }
> -
> - /*
> - * pctcpu is only for ps?
> - * Do it per kse.. and add them up at the end?
> - * XXXKSE
> - */
> - ke->ke_pctcpu = (ke->ke_pctcpu * ccpu) >> FSHIFT;
> - /*
> - * If the kse has been idle the entire second,
> - * stop recalculating its priority until
> - * it wakes up.
> - */
> - if (ke->ke_slptime > 1) {
> - continue;
> - }
> -
> -#if (FSHIFT >= CCPU_SHIFT)
> - ke->ke_pctcpu += (realstathz == 100) ?
> - ((fixpt_t) ke->ke_cpticks) <<
> - (FSHIFT - CCPU_SHIFT) :
> - 100 * (((fixpt_t) ke->ke_cpticks) <<
> - (FSHIFT - CCPU_SHIFT)) / realstathz;
> -#else
> - ke->ke_pctcpu += ((FSCALE - ccpu) *
> - (ke->ke_cpticks * FSCALE / realstathz)) >>
> - FSHIFT;
> -#endif
> - ke->ke_cpticks = 0;
> - } /* end of kse loop */
> - if (awake == 0) {
> - kg->kg_slptime++;
> - } else {
> - kg->kg_slptime = 0;
> - }
> - kg->kg_estcpu = decay_cpu(loadfac, kg->kg_estcpu);
> - resetpriority(kg);
> - FOREACH_THREAD_IN_GROUP(kg, td) {
> - int changedqueue;
> - if (td->td_priority >= PUSER) {
> - /*
> - * Only change the priority
> - * of threads that are still at their
> - * user priority.
> - * XXXKSE This is problematic
> - * as we may need to re-order
> - * the threads on the KSEG list.
> - */
> - changedqueue =
> - ((td->td_priority / RQ_PPQ) !=
> - (kg->kg_user_pri / RQ_PPQ));
> -
> - td->td_priority = kg->kg_user_pri;
> - if (changedqueue &&
> - td->td_state == TDS_RUNQ) {
> - /* this could be optimised */
> - remrunqueue(td);
> - td->td_priority =
> - kg->kg_user_pri;
> - setrunqueue(td);
> - } else {
> - td->td_priority = kg->kg_user_pri;
> - }
> - }
> - }
> + schedcpu1(kg);
> } /* end of ksegrp loop */
> mtx_unlock_spin(&sched_lock);
> } /* end of process loop */
> @@ -944,32 +832,6 @@
> }
>
> /*
> - * Compute the priority of a process when running in user mode.
> - * Arrange to reschedule if the resulting priority is better
> - * than that of the current process.
> - */
> -void
> -resetpriority(kg)
> - register struct ksegrp *kg;
> -{
> - register unsigned int newpriority;
> - struct thread *td;
> -
> - mtx_lock_spin(&sched_lock);
> - if (kg->kg_pri_class == PRI_TIMESHARE) {
> - newpriority = PUSER + kg->kg_estcpu / INVERSE_ESTCPU_WEIGHT +
> - NICE_WEIGHT * (kg->kg_nice - PRIO_MIN);
> - newpriority = min(max(newpriority, PRI_MIN_TIMESHARE),
> - PRI_MAX_TIMESHARE);
> - kg->kg_user_pri = newpriority;
> - }
> - FOREACH_THREAD_IN_GROUP(kg, td) {
> - maybe_resched(td); /* XXXKSE silly */
> - }
> - mtx_unlock_spin(&sched_lock);
> -}
> -
> -/*
> * Compute a tenex style load average of a quantity on
> * 1, 5 and 15 minute intervals.
> * XXXKSE Needs complete rewrite when correct info is available.
> @@ -1022,11 +884,9 @@
> {
>
> callout_init(&schedcpu_callout, 1);
> - callout_init(&roundrobin_callout, 0);
> callout_init(&loadav_callout, 0);
>
> /* Kick off timeout driven events by calling first time. */
> - roundrobin(NULL);
> schedcpu(NULL);
> loadav(NULL);
> }
>
> ----- End forwarded message -----
>
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-arch" in the body of the message
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20020718163157.A27017>
