From owner-freebsd-arch Thu Jul 18 16:32:41 2002 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 0AF6437B400; Thu, 18 Jul 2002 16:31:58 -0700 (PDT) Received: from iguana.icir.org (iguana.icir.org [192.150.187.36]) by mx1.FreeBSD.org (Postfix) with ESMTP id 66BA643E67; Thu, 18 Jul 2002 16:31:57 -0700 (PDT) (envelope-from rizzo@iguana.icir.org) Received: (from rizzo@localhost) by iguana.icir.org (8.11.6/8.11.3) id g6INVvf27426; Thu, 18 Jul 2002 16:31:57 -0700 (PDT) (envelope-from rizzo) Date: Thu, 18 Jul 2002 16:31:57 -0700 From: Luigi Rizzo To: arch@FreeBSD.ORG Subject: NEW SCHEDULER available (was Re: proposed changes to kern_switch.c and kern_synch.c) Message-ID: <20020718163157.A27017@iguana.icir.org> References: <20020716235216.B6785@iguana.icir.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i In-Reply-To: <20020716235216.B6785@iguana.icir.org>; from rizzo@icir.org on Tue, Jul 16, 2002 at 11:52:16PM -0700 Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG [crossposted to -stable as relevant there] Hi, as promised, a first version of the Proportional Share scheduler that we developed is available at http://info.iet.unipi.it/~luigi/ps_sched.20020719a.diff These are for a recent -STABLE (i think any version from 4.4 should work; the only 3 files modified are kern_synch.c, kern_switch.c and proc.h, plus a one-line change to kern_exit.c). I have tested it a little bit on a diskless system, and it seems to survive running a full X session with the usual set of xterm, netscape etc. while i do a "renice" of the processes and even switch back and forth between schedulers. But do not trust this yet for a production system! The sysctl variable kern.scheduler controls the scheduler in use. kern.scheduler=1 (default) selects Proportional Share kern.scheduler=0 selects the Feedback Priority ("classic BSD") You can switch between the two schedulers by changing the value of the sysctl variable. To change the "weight" associated to a process, use the "nice" and "renice" commands. As usual, positive values (+1..+20) mean the process will get a smaller share of the CPU, negative values (-1..-20) will get a larger share. The actual weights (which control the relative fraction of CPU cycles given to each process) are assigned through a lookup table which maps the value to the range 1 ... 100 ... 1000 (min, normal, max weight). The "old" scheduler (Feedback Priority) should be as robust and stable as always, whereas there are still a few things to cleanup in the Proportional Share scheduler, namely: * I have not looked in detail at the SMP case, so do not run it on an SMP kernel; * given that there are no priorities, processes woken up by a tsleep() are not guaranteed to run before all processes with p_priority >= PUSER; however, they are given a shorter deadline so they are likely to run first. * RTPRI and IDLEPRI processes do not have yet any special treatment (they all end up together with normal processes; this is trivial to fix, i just haven't had time to look at that). Have fun, and please if you use it let me know of any bugs and possibly suggestions to adapt it to -current. cheers luigi On Tue, Jul 16, 2002 at 11:52:16PM -0700, Luigi Rizzo wrote: > Hi, > we have implemented a weight-based process scheduler > for FreeBSD-stable, and are trying to port it to -current (in the > description below replace "process" with "thread/kse" as appropriate > for the -current case). > > In order to make this work, it is convenient to have all > scheduler-specific functions and data structures in a > single file (kern_switch*.c), and generic support in > another one (kern_synch.c). I believe this was also the original > BSD design in partitioning the code between the two files. > However, in our current code, there are some functions which > are scheduler-specific (see below) which are misplaced. > > So I would like to make the following, purely cosmetic, change > identified as step #1 below, both for consistency with what i > believe to be the original design, and to simplify further work > in this area. These would go to both -current and -stable; I > repeat, they are only cosmetic changes. > > Comments ? I already spoke to julian who has no objections. > > For those interested, a patch for -current is attached, and the one > for -stable is very similar (for the records, in -stable we have a > fully functional weight-based process scheduler, where you still > control the weights using "nice", and you can switch between the > old and the new scheduler at any time using a sysctl variable). > > --- Re. the multi-scheduler architecture --- > > The general idea is to make the process/thread/kse scheduler > a replaceable piece of the kernel, requiring no modifications > to the "struct proc", and with the ability of switching from > one scheduler to another one at runtime (this both for testing > purposes and for whatever need may arise). > > > The way to achieve this would be the following: > > 1. identify all scheduler-specific functions, put all of them > in one file (e.g. kern_switch.c for the default scheduler), > and define function pointers for all of them; > > 2. use one field in "struct proc" as a link to whatever > scheduler-specific data structures are necessary. > In -stable, we have the p_pad3 field that can be used > for this purpose, much like the if_index field in struct ifnet. > > 3. implement a function which, under control of a sysctl call, > activate a new scheduler (by initialising all the function > pointers to appropriate values) and moves all processes from > the old to the new one. > > Step #1 is basically a cosmetic change, requiring mostly a move of > some functions from kern_synch.c to kern_switch.c. These are > > roundrobin(); > > curpriority_cmp(); > > resetpriority(); > > parts of schedcpu() and schedclock(); > > cheers > luigi > ---------------------------------- > > Index: kern_switch.c > =================================================================== > RCS file: /home/ncvs/src/sys/kern/kern_switch.c,v > retrieving revision 1.33 > diff -u -r1.33 kern_switch.c > --- kern_switch.c 14 Jul 2002 03:43:33 -0000 1.33 > +++ kern_switch.c 16 Jul 2002 22:15:06 -0000 > @@ -97,6 +97,7 @@ > #include > #include > #include > +#include > #include > > CTASSERT((RQB_BPW * RQB_LEN) == RQ_NQS); > @@ -107,6 +108,9 @@ > static struct runq runq; > SYSINIT(runq, SI_SUB_RUN_QUEUE, SI_ORDER_FIRST, runq_init, &runq) > > +static struct callout roundrobin_callout; > + > +static void roundrobin(void *arg); > static void runq_readjust(struct runq *rq, struct kse *ke); > /************************************************************************ > * Functions that manipulate runnability from a thread perspective. * > @@ -442,9 +446,13 @@ > { > int i; > > + callout_init(&roundrobin_callout, 0); > + > bzero(rq, sizeof *rq); > for (i = 0; i < RQ_NQS; i++) > TAILQ_INIT(&rq->rq_queues[i]); > + > + roundrobin(NULL); > } > > /* > @@ -719,3 +727,207 @@ > } > #endif > > +/* > + * -- formerly in kern_synch.c > + */ > + > +int curpriority_cmp(struct thread *td); > + > +int > +curpriority_cmp(struct thread *td) > +{ > + return (td->td_priority - curthread->td_priority); > +} > + > +/* > + * Force switch among equal priority processes every 100ms. > + * We don't actually need to force a context switch of the current process. > + * The act of firing the event triggers a context switch to softclock() and > + * then switching back out again which is equivalent to a preemption, thus > + * no further work is needed on the local CPU. > + */ > +/* ARGSUSED */ > +static void > +roundrobin(arg) > + void *arg; > +{ > + > +#ifdef SMP > + mtx_lock_spin(&sched_lock); > + forward_roundrobin(); > + mtx_unlock_spin(&sched_lock); > +#endif > + > + callout_reset(&roundrobin_callout, sched_quantum, roundrobin, NULL); > +} > + > +/* calculations for digital decay to forget 90% of usage in 5*loadav sec */ > +#define loadfactor(loadav) (2 * (loadav)) > +#define decay_cpu(loadfac, cpu) (((loadfac) * (cpu)) / ((loadfac) + FSCALE)) > +extern fixpt_t ccpu; > +#define CCPU_SHIFT 11 /* XXX dup from kern_synch.c! */ > + > +/* > + * Recompute process priorities, every hz ticks. > + * MP-safe, called without the Giant mutex. > + */ > +void schedcpu1(struct ksegrp *kg); > +/* ARGSUSED */ > +void > +schedcpu1(struct ksegrp *kg) > +{ > + register fixpt_t loadfac = loadfactor(averunnable.ldavg[0]); > + struct thread *td; > + struct kse *ke; > + int realstathz; > + int awake; > + > + realstathz = stathz ? stathz : hz; > + > + awake = 0; > + FOREACH_KSE_IN_GROUP(kg, ke) { > + /* > + * Increment time in/out of memory and sleep > + * time (if sleeping). We ignore overflow; > + * with 16-bit int's (remember them?) > + * overflow takes 45 days. > + */ > + /* XXXKSE **WRONG***/ > + /* > + * the kse slptimes are not touched in wakeup > + * because the thread may not HAVE a KSE > + */ > + if ((ke->ke_state == KES_ONRUNQ) || > + ((ke->ke_state == KES_THREAD) && > + (ke->ke_thread->td_state == TDS_RUNNING))) { > + ke->ke_slptime++; > + } else { > + ke->ke_slptime = 0; > + awake = 1; > + } > + > + /* > + * pctcpu is only for ps? > + * Do it per kse.. and add them up at the end? > + * XXXKSE > + */ > + ke->ke_pctcpu = (ke->ke_pctcpu * ccpu) >> FSHIFT; > + /* > + * If the kse has been idle the entire second, > + * stop recalculating its priority until > + * it wakes up. > + */ > + if (ke->ke_slptime > 1) { > + continue; > + } > + > +#if (FSHIFT >= CCPU_SHIFT) > + ke->ke_pctcpu += (realstathz == 100) ? > + ((fixpt_t) ke->ke_cpticks) << > + (FSHIFT - CCPU_SHIFT) : > + 100 * (((fixpt_t) ke->ke_cpticks) << > + (FSHIFT - CCPU_SHIFT)) / realstathz; > +#else > + ke->ke_pctcpu += ((FSCALE - ccpu) * > + (ke->ke_cpticks * FSCALE / realstathz)) >> > + FSHIFT; > +#endif > + ke->ke_cpticks = 0; > + } /* end of kse loop */ > + if (awake == 0) { > + kg->kg_slptime++; > + } else { > + kg->kg_slptime = 0; > + } > + kg->kg_estcpu = decay_cpu(loadfac, kg->kg_estcpu); > + resetpriority(kg); > + FOREACH_THREAD_IN_GROUP(kg, td) { > + int changedqueue; > + if (td->td_priority >= PUSER) { > + /* > + * Only change the priority > + * of threads that are still at their > + * user priority. > + * XXXKSE This is problematic > + * as we may need to re-order > + * the threads on the KSEG list. > + */ > + changedqueue = > + ((td->td_priority / RQ_PPQ) != > + (kg->kg_user_pri / RQ_PPQ)); > + > + td->td_priority = kg->kg_user_pri; > + if (changedqueue && > + td->td_state == TDS_RUNQ) { > + /* this could be optimised */ > + remrunqueue(td); > + td->td_priority = > + kg->kg_user_pri; > + setrunqueue(td); > + } else { > + td->td_priority = kg->kg_user_pri; > + } > + } > + } > +} > + > +/* > + * Compute the priority of a process when running in user mode. > + * Arrange to reschedule if the resulting priority is better > + * than that of the current process. > + */ > +void > +resetpriority(kg) > + register struct ksegrp *kg; > +{ > + register unsigned int newpriority; > + struct thread *td; > + > + mtx_lock_spin(&sched_lock); > + if (kg->kg_pri_class == PRI_TIMESHARE) { > + newpriority = PUSER + kg->kg_estcpu / INVERSE_ESTCPU_WEIGHT + > + NICE_WEIGHT * (kg->kg_nice - PRIO_MIN); > + newpriority = min(max(newpriority, PRI_MIN_TIMESHARE), > + PRI_MAX_TIMESHARE); > + kg->kg_user_pri = newpriority; > + } > + FOREACH_THREAD_IN_GROUP(kg, td) { > + maybe_resched(td); /* XXXKSE silly */ > + } > + mtx_unlock_spin(&sched_lock); > +} > + > +#if 0 > +/* > + * We adjust the priority of the current process. The priority of > + * a process gets worse as it accumulates CPU time. The cpu usage > + * estimator (p_estcpu) is increased here. resetpriority() will > + * compute a different priority each time p_estcpu increases by > + * INVERSE_ESTCPU_WEIGHT > + * (until MAXPRI is reached). The cpu usage estimator ramps up > + * quite quickly when the process is running (linearly), and decays > + * away exponentially, at a rate which is proportionally slower when > + * the system is busy. The basic principle is that the system will > + * 90% forget that the process used a lot of CPU time in 5 * loadav > + * seconds. This causes the system to favor processes which haven't > + * run much recently, and to round-robin among other processes. > + */ > +void > +schedclock(td) > + struct thread *td; > +{ > + struct kse *ke; > + struct ksegrp *kg; > + > + KASSERT((td != NULL), ("schedlock: null thread pointer")); > + ke = td->td_kse; > + kg = td->td_ksegrp; > + ke->ke_cpticks++; > + kg->kg_estcpu = ESTCPULIM(kg->kg_estcpu + 1); > + if ((kg->kg_estcpu % INVERSE_ESTCPU_WEIGHT) == 0) { > + resetpriority(kg); > + if (td->td_priority >= PUSER) > + td->td_priority = kg->kg_user_pri; > + } > +} > +#endif > Index: kern_synch.c > =================================================================== > RCS file: /home/ncvs/src/sys/kern/kern_synch.c,v > retrieving revision 1.185 > diff -u -r1.185 kern_synch.c > --- kern_synch.c 14 Jul 2002 03:43:33 -0000 1.185 > +++ kern_synch.c 16 Jul 2002 22:16:46 -0000 > @@ -67,6 +67,8 @@ > > #include > > +void schedcpu1(struct ksegrp *kg); /* XXX in kern_switch.c */ > + > static void sched_setup(void *dummy); > SYSINIT(sched_setup, SI_SUB_KICK_SCHEDULER, SI_ORDER_FIRST, sched_setup, NULL) > > @@ -76,7 +78,6 @@ > > static struct callout loadav_callout; > static struct callout schedcpu_callout; > -static struct callout roundrobin_callout; > > struct loadavg averunnable = > { {0, 0, 0}, FSCALE }; /* load average, of runnable procs */ > @@ -92,7 +93,6 @@ > > static void endtsleep(void *); > static void loadav(void *arg); > -static void roundrobin(void *arg); > static void schedcpu(void *arg); > > static int > @@ -135,28 +135,6 @@ > } > > /* > - * Force switch among equal priority processes every 100ms. > - * We don't actually need to force a context switch of the current process. > - * The act of firing the event triggers a context switch to softclock() and > - * then switching back out again which is equivalent to a preemption, thus > - * no further work is needed on the local CPU. > - */ > -/* ARGSUSED */ > -static void > -roundrobin(arg) > - void *arg; > -{ > - > -#ifdef SMP > - mtx_lock_spin(&sched_lock); > - forward_roundrobin(); > - mtx_unlock_spin(&sched_lock); > -#endif > - > - callout_reset(&roundrobin_callout, sched_quantum, roundrobin, NULL); > -} > - > -/* > * Constants for digital decay and forget: > * 90% of (p_estcpu) usage in 5 * loadav time > * 95% of (p_pctcpu) usage in 60 seconds (load insensitive) > @@ -225,7 +203,7 @@ > #define decay_cpu(loadfac, cpu) (((loadfac) * (cpu)) / ((loadfac) + FSCALE)) > > /* decay 95% of `p_pctcpu' in 60 seconds; see CCPU_SHIFT before changing */ > -static fixpt_t ccpu = 0.95122942450071400909 * FSCALE; /* exp(-1/20) */ > +fixpt_t ccpu = 0.95122942450071400909 * FSCALE; /* exp(-1/20) */ > SYSCTL_INT(_kern, OID_AUTO, ccpu, CTLFLAG_RD, &ccpu, 0, ""); > > /* kernel uses `FSCALE', userland (SHOULD) use kern.fscale */ > @@ -255,105 +233,15 @@ > schedcpu(arg) > void *arg; > { > - register fixpt_t loadfac = loadfactor(averunnable.ldavg[0]); > - struct thread *td; > struct proc *p; > - struct kse *ke; > struct ksegrp *kg; > - int realstathz; > - int awake; > > - realstathz = stathz ? stathz : hz; > sx_slock(&allproc_lock); > FOREACH_PROC_IN_SYSTEM(p) { > mtx_lock_spin(&sched_lock); > p->p_swtime++; > FOREACH_KSEGRP_IN_PROC(p, kg) { > - awake = 0; > - FOREACH_KSE_IN_GROUP(kg, ke) { > - /* > - * Increment time in/out of memory and sleep > - * time (if sleeping). We ignore overflow; > - * with 16-bit int's (remember them?) > - * overflow takes 45 days. > - */ > - /* XXXKSE **WRONG***/ > - /* > - * the kse slptimes are not touched in wakeup > - * because the thread may not HAVE a KSE > - */ > - if ((ke->ke_state == KES_ONRUNQ) || > - ((ke->ke_state == KES_THREAD) && > - (ke->ke_thread->td_state == TDS_RUNNING))) { > - ke->ke_slptime++; > - } else { > - ke->ke_slptime = 0; > - awake = 1; > - } > - > - /* > - * pctcpu is only for ps? > - * Do it per kse.. and add them up at the end? > - * XXXKSE > - */ > - ke->ke_pctcpu = (ke->ke_pctcpu * ccpu) >> FSHIFT; > - /* > - * If the kse has been idle the entire second, > - * stop recalculating its priority until > - * it wakes up. > - */ > - if (ke->ke_slptime > 1) { > - continue; > - } > - > -#if (FSHIFT >= CCPU_SHIFT) > - ke->ke_pctcpu += (realstathz == 100) ? > - ((fixpt_t) ke->ke_cpticks) << > - (FSHIFT - CCPU_SHIFT) : > - 100 * (((fixpt_t) ke->ke_cpticks) << > - (FSHIFT - CCPU_SHIFT)) / realstathz; > -#else > - ke->ke_pctcpu += ((FSCALE - ccpu) * > - (ke->ke_cpticks * FSCALE / realstathz)) >> > - FSHIFT; > -#endif > - ke->ke_cpticks = 0; > - } /* end of kse loop */ > - if (awake == 0) { > - kg->kg_slptime++; > - } else { > - kg->kg_slptime = 0; > - } > - kg->kg_estcpu = decay_cpu(loadfac, kg->kg_estcpu); > - resetpriority(kg); > - FOREACH_THREAD_IN_GROUP(kg, td) { > - int changedqueue; > - if (td->td_priority >= PUSER) { > - /* > - * Only change the priority > - * of threads that are still at their > - * user priority. > - * XXXKSE This is problematic > - * as we may need to re-order > - * the threads on the KSEG list. > - */ > - changedqueue = > - ((td->td_priority / RQ_PPQ) != > - (kg->kg_user_pri / RQ_PPQ)); > - > - td->td_priority = kg->kg_user_pri; > - if (changedqueue && > - td->td_state == TDS_RUNQ) { > - /* this could be optimised */ > - remrunqueue(td); > - td->td_priority = > - kg->kg_user_pri; > - setrunqueue(td); > - } else { > - td->td_priority = kg->kg_user_pri; > - } > - } > - } > + schedcpu1(kg); > } /* end of ksegrp loop */ > mtx_unlock_spin(&sched_lock); > } /* end of process loop */ > @@ -944,32 +832,6 @@ > } > > /* > - * Compute the priority of a process when running in user mode. > - * Arrange to reschedule if the resulting priority is better > - * than that of the current process. > - */ > -void > -resetpriority(kg) > - register struct ksegrp *kg; > -{ > - register unsigned int newpriority; > - struct thread *td; > - > - mtx_lock_spin(&sched_lock); > - if (kg->kg_pri_class == PRI_TIMESHARE) { > - newpriority = PUSER + kg->kg_estcpu / INVERSE_ESTCPU_WEIGHT + > - NICE_WEIGHT * (kg->kg_nice - PRIO_MIN); > - newpriority = min(max(newpriority, PRI_MIN_TIMESHARE), > - PRI_MAX_TIMESHARE); > - kg->kg_user_pri = newpriority; > - } > - FOREACH_THREAD_IN_GROUP(kg, td) { > - maybe_resched(td); /* XXXKSE silly */ > - } > - mtx_unlock_spin(&sched_lock); > -} > - > -/* > * Compute a tenex style load average of a quantity on > * 1, 5 and 15 minute intervals. > * XXXKSE Needs complete rewrite when correct info is available. > @@ -1022,11 +884,9 @@ > { > > callout_init(&schedcpu_callout, 1); > - callout_init(&roundrobin_callout, 0); > callout_init(&loadav_callout, 0); > > /* Kick off timeout driven events by calling first time. */ > - roundrobin(NULL); > schedcpu(NULL); > loadav(NULL); > } > > ----- End forwarded message ----- > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-arch" in the body of the message To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message