From owner-freebsd-arch  Thu Jul 18 16:32:41 2002
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id 0AF6437B400; Thu, 18 Jul 2002 16:31:58 -0700 (PDT)
Received: from iguana.icir.org (iguana.icir.org [192.150.187.36])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 66BA643E67; Thu, 18 Jul 2002 16:31:57 -0700 (PDT)
	(envelope-from rizzo@iguana.icir.org)
Received: (from rizzo@localhost)
	by iguana.icir.org (8.11.6/8.11.3) id g6INVvf27426;
	Thu, 18 Jul 2002 16:31:57 -0700 (PDT)
	(envelope-from rizzo)
Date: Thu, 18 Jul 2002 16:31:57 -0700
From: Luigi Rizzo <rizzo@icir.org>
To: arch@FreeBSD.ORG
Subject: NEW SCHEDULER available (was Re: proposed changes to kern_switch.c and kern_synch.c)
Message-ID: <20020718163157.A27017@iguana.icir.org>
References: <20020716235216.B6785@iguana.icir.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5.1i
In-Reply-To: <20020716235216.B6785@iguana.icir.org>; from rizzo@icir.org on Tue, Jul 16, 2002 at 11:52:16PM -0700
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

[crossposted to -stable as relevant there]

Hi,

as promised, a first version of the Proportional Share scheduler
that we developed is available at

    http://info.iet.unipi.it/~luigi/ps_sched.20020719a.diff

These are for a recent -STABLE (i think any version from 4.4 should
work; the only 3 files modified are kern_synch.c, kern_switch.c and
proc.h, plus a one-line change to kern_exit.c).

I have tested it a little bit on a diskless system, and it seems
to survive running a full X session with the usual set of xterm,
netscape etc. while i do a "renice" of the processes and even switch
back and forth between schedulers. But do not trust this yet for a
production system!

The sysctl variable kern.scheduler controls the scheduler in use.

    kern.scheduler=1 (default) selects Proportional Share
    kern.scheduler=0 selects the Feedback Priority ("classic BSD")

You can switch between the two schedulers by changing the value of
the sysctl variable.

To change the "weight" associated to a process, use the "nice" and
"renice" commands. As usual, positive values (+1..+20) mean the
process will get a smaller share of the CPU, negative values (-1..-20)
will get a larger share. The actual weights (which control the
relative fraction of CPU cycles given to each process) are assigned
through a lookup table which maps the value to the range
  1 ... 100 ... 1000 (min, normal, max weight).

The "old" scheduler (Feedback Priority) should be as robust and
stable as always, whereas there are still a few things to cleanup
in the Proportional Share scheduler, namely:

  * I have not looked in detail at the SMP case, so do not run
    it on an SMP kernel;

  * given that there are no priorities, processes woken up by a
    tsleep() are not guaranteed to run before all processes
    with p_priority >= PUSER; however, they are given a shorter
    deadline so they are likely to run first.

  * RTPRI and IDLEPRI processes do not have yet any special treatment
    (they all end up together with normal processes; this is trivial
    to fix, i just haven't had time to look at that).

Have fun, and please if you use it let me know of any bugs and
possibly suggestions to adapt it to -current.

	cheers
	luigi

On Tue, Jul 16, 2002 at 11:52:16PM -0700, Luigi Rizzo wrote:
> Hi,
> we have implemented a weight-based process scheduler
> for FreeBSD-stable, and are trying to port it to -current (in the
> description below replace "process" with "thread/kse" as appropriate
> for the -current case).
> 
> In order to make this work, it is convenient to have all
> scheduler-specific functions and data structures in a
> single file (kern_switch*.c), and generic support in
> another one (kern_synch.c). I believe this was also the original
> BSD design in partitioning the code between the two files.
> However, in our current code, there are some functions which
> are scheduler-specific (see below) which are misplaced.
> 
> So I would like to make the following, purely cosmetic, change
> identified as step #1 below, both for consistency with what i
> believe to be the original design, and to simplify further work
> in this area. These would go to both -current and -stable; I
> repeat, they are only cosmetic changes.
> 
> Comments ? I already spoke to julian who has no objections.
> 
> For those interested, a patch for -current is attached, and the one
> for -stable is very similar (for the records, in -stable we have a
> fully functional weight-based process scheduler, where you still
> control the weights using "nice", and you can switch between the
> old and the new scheduler at any time using a sysctl variable).
> 
> --- Re. the multi-scheduler architecture ---
> 
> The general idea is to make the process/thread/kse scheduler
> a replaceable piece of the kernel, requiring no modifications
> to the "struct proc", and with the ability of switching from
> one scheduler to another one at runtime (this both for testing
> purposes and for whatever need may arise).
> 
> 
> The way to achieve this would be the following:
> 
>  1. identify all scheduler-specific functions, put all of them
>     in one file (e.g. kern_switch.c for the default scheduler),
>     and define function pointers for all of them;
> 
>  2. use one field in "struct proc" as a link to whatever
>     scheduler-specific data structures are necessary.
>     In -stable, we have the p_pad3 field that can be used
>     for this purpose, much like the if_index field in struct ifnet.
> 
>  3. implement a function which, under control of a sysctl call,
>     activate a new scheduler (by initialising all the function
>     pointers to appropriate values) and moves all processes from
>     the old to the new one.
> 
> Step #1 is basically a cosmetic change, requiring mostly a move of
> some functions from kern_synch.c to kern_switch.c. These are
> 
> 	roundrobin();
> 
> 	curpriority_cmp();
> 
> 	resetpriority();
> 
> 	parts of schedcpu() and schedclock();
> 
> cheers
> luigi
> ----------------------------------
> 
> Index: kern_switch.c
> ===================================================================
> RCS file: /home/ncvs/src/sys/kern/kern_switch.c,v
> retrieving revision 1.33
> diff -u -r1.33 kern_switch.c
> --- kern_switch.c	14 Jul 2002 03:43:33 -0000	1.33
> +++ kern_switch.c	16 Jul 2002 22:15:06 -0000
> @@ -97,6 +97,7 @@
>  #include <sys/mutex.h>
>  #include <sys/proc.h>
>  #include <sys/queue.h>
> +#include <sys/resourcevar.h>
>  #include <machine/critical.h>
>  
>  CTASSERT((RQB_BPW * RQB_LEN) == RQ_NQS);
> @@ -107,6 +108,9 @@
>  static struct runq runq;
>  SYSINIT(runq, SI_SUB_RUN_QUEUE, SI_ORDER_FIRST, runq_init, &runq)
>  
> +static struct callout roundrobin_callout;
> +
> +static void	roundrobin(void *arg);
>  static void runq_readjust(struct runq *rq, struct kse *ke);
>  /************************************************************************
>   * Functions that manipulate runnability from a thread perspective.	*
> @@ -442,9 +446,13 @@
>  {
>  	int i;
>  
> +	callout_init(&roundrobin_callout, 0);
> +
>  	bzero(rq, sizeof *rq);
>  	for (i = 0; i < RQ_NQS; i++)
>  		TAILQ_INIT(&rq->rq_queues[i]);
> +
> +	roundrobin(NULL);
>  }
>  
>  /*
> @@ -719,3 +727,207 @@
>  }
>  #endif
>  
> +/*
> + * -- formerly in kern_synch.c
> + */
> +
> +int curpriority_cmp(struct thread *td);
> +
> +int
> +curpriority_cmp(struct thread *td)
> +{
> +	return (td->td_priority - curthread->td_priority);
> +}
> +
> +/*
> + * Force switch among equal priority processes every 100ms.
> + * We don't actually need to force a context switch of the current process.
> + * The act of firing the event triggers a context switch to softclock() and
> + * then switching back out again which is equivalent to a preemption, thus
> + * no further work is needed on the local CPU.
> + */
> +/* ARGSUSED */
> +static void
> +roundrobin(arg)
> +	void *arg;
> +{
> +
> +#ifdef SMP
> +	mtx_lock_spin(&sched_lock);
> +	forward_roundrobin();
> +	mtx_unlock_spin(&sched_lock);
> +#endif
> +
> +	callout_reset(&roundrobin_callout, sched_quantum, roundrobin, NULL);
> +}
> +
> +/* calculations for digital decay to forget 90% of usage in 5*loadav sec */
> +#define loadfactor(loadav)      (2 * (loadav))
> +#define decay_cpu(loadfac, cpu) (((loadfac) * (cpu)) / ((loadfac) + FSCALE))
> +extern fixpt_t ccpu;
> +#define CCPU_SHIFT      11	/* XXX dup from kern_synch.c! */
> +
> +/*
> + * Recompute process priorities, every hz ticks.
> + * MP-safe, called without the Giant mutex.
> + */
> +void schedcpu1(struct ksegrp *kg);
> +/* ARGSUSED */
> +void
> +schedcpu1(struct ksegrp *kg)
> +{
> +	register fixpt_t loadfac = loadfactor(averunnable.ldavg[0]);
> +	struct thread *td;
> +	struct kse *ke;
> +	int realstathz;
> +	int awake;
> +
> +        realstathz = stathz ? stathz : hz;
> +
> +	awake = 0;
> +	FOREACH_KSE_IN_GROUP(kg, ke) {
> +		/*
> +		 * Increment time in/out of memory and sleep
> +		 * time (if sleeping).  We ignore overflow;
> +		 * with 16-bit int's (remember them?)
> +		 * overflow takes 45 days.
> +		 */
> +		/* XXXKSE **WRONG***/
> +		/*
> +		 * the kse slptimes are not touched in wakeup
> +		 * because the thread may not HAVE a KSE
> +		 */
> +		if ((ke->ke_state == KES_ONRUNQ) ||
> +		    ((ke->ke_state == KES_THREAD) &&
> +		    (ke->ke_thread->td_state == TDS_RUNNING))) {
> +			ke->ke_slptime++;
> +		} else {
> +			ke->ke_slptime = 0;
> +			awake = 1;
> +		}
> +
> +		/*
> +		 * pctcpu is only for ps?
> +		 * Do it per kse.. and add them up at the end?
> +		 * XXXKSE
> +		 */
> +		ke->ke_pctcpu = (ke->ke_pctcpu * ccpu) >> FSHIFT;
> +		/*
> +		 * If the kse has been idle the entire second,
> +		 * stop recalculating its priority until
> +		 * it wakes up.
> +		 */
> +		if (ke->ke_slptime > 1) {
> +			continue;
> +		}
> +
> +#if	(FSHIFT >= CCPU_SHIFT)
> +		ke->ke_pctcpu += (realstathz == 100) ?
> +		    ((fixpt_t) ke->ke_cpticks) <<
> +		    (FSHIFT - CCPU_SHIFT) :
> +		    100 * (((fixpt_t) ke->ke_cpticks) <<
> +		    (FSHIFT - CCPU_SHIFT)) / realstathz;
> +#else
> +		ke->ke_pctcpu += ((FSCALE - ccpu) *
> +		    (ke->ke_cpticks * FSCALE / realstathz)) >>
> +		    FSHIFT;
> +#endif
> +		ke->ke_cpticks = 0;
> +	} /* end of kse loop */
> +	if (awake == 0) {
> +		kg->kg_slptime++;
> +	} else {
> +		kg->kg_slptime = 0;
> +	}
> +	kg->kg_estcpu = decay_cpu(loadfac, kg->kg_estcpu);
> +	resetpriority(kg);
> +	FOREACH_THREAD_IN_GROUP(kg, td) {
> +		int changedqueue;
> +		if (td->td_priority >= PUSER) {
> +			/*
> +			 * Only change the priority
> +			 * of threads that are still at their
> +			 * user priority. 
> +			 * XXXKSE This is problematic
> +			 * as we may need to re-order
> +			 * the threads on the KSEG list.
> +			 */
> +			changedqueue =
> +			    ((td->td_priority / RQ_PPQ) !=
> +			     (kg->kg_user_pri / RQ_PPQ));
> +
> +			td->td_priority = kg->kg_user_pri;
> +			if (changedqueue &&
> +			    td->td_state == TDS_RUNQ) {
> +				/* this could be optimised */
> +				remrunqueue(td);
> +				td->td_priority =
> +				    kg->kg_user_pri;
> +				setrunqueue(td);
> +			} else {
> +				td->td_priority = kg->kg_user_pri;
> +			}
> +		}
> +	}
> +}
> +
> +/*
> + * Compute the priority of a process when running in user mode.
> + * Arrange to reschedule if the resulting priority is better
> + * than that of the current process.
> + */
> +void
> +resetpriority(kg)
> +	register struct ksegrp *kg;
> +{
> +	register unsigned int newpriority;
> +	struct thread *td;
> +
> +	mtx_lock_spin(&sched_lock);
> +	if (kg->kg_pri_class == PRI_TIMESHARE) {
> +		newpriority = PUSER + kg->kg_estcpu / INVERSE_ESTCPU_WEIGHT +
> +		    NICE_WEIGHT * (kg->kg_nice - PRIO_MIN);
> +		newpriority = min(max(newpriority, PRI_MIN_TIMESHARE),
> +		    PRI_MAX_TIMESHARE);
> +		kg->kg_user_pri = newpriority;
> +	}
> +	FOREACH_THREAD_IN_GROUP(kg, td) {
> +		maybe_resched(td);			/* XXXKSE silly */
> +	}
> +	mtx_unlock_spin(&sched_lock);
> +}
> +
> +#if 0
> +/*
> + * We adjust the priority of the current process.  The priority of
> + * a process gets worse as it accumulates CPU time.  The cpu usage
> + * estimator (p_estcpu) is increased here.  resetpriority() will
> + * compute a different priority each time p_estcpu increases by
> + * INVERSE_ESTCPU_WEIGHT
> + * (until MAXPRI is reached).  The cpu usage estimator ramps up
> + * quite quickly when the process is running (linearly), and decays
> + * away exponentially, at a rate which is proportionally slower when
> + * the system is busy.  The basic principle is that the system will
> + * 90% forget that the process used a lot of CPU time in 5 * loadav
> + * seconds.  This causes the system to favor processes which haven't
> + * run much recently, and to round-robin among other processes.
> + */
> +void
> +schedclock(td)
> +	struct thread *td;
> +{
> +	struct kse *ke;
> +	struct ksegrp *kg;
> +
> +	KASSERT((td != NULL), ("schedlock: null thread pointer"));
> +	ke = td->td_kse;
> +	kg = td->td_ksegrp;
> +	ke->ke_cpticks++;
> +	kg->kg_estcpu = ESTCPULIM(kg->kg_estcpu + 1);
> +	if ((kg->kg_estcpu % INVERSE_ESTCPU_WEIGHT) == 0) {
> +		resetpriority(kg);
> +		if (td->td_priority >= PUSER)
> +			td->td_priority = kg->kg_user_pri;
> +	}
> +}
> +#endif
> Index: kern_synch.c
> ===================================================================
> RCS file: /home/ncvs/src/sys/kern/kern_synch.c,v
> retrieving revision 1.185
> diff -u -r1.185 kern_synch.c
> --- kern_synch.c	14 Jul 2002 03:43:33 -0000	1.185
> +++ kern_synch.c	16 Jul 2002 22:16:46 -0000
> @@ -67,6 +67,8 @@
>  
>  #include <machine/cpu.h>
>  
> +void	schedcpu1(struct ksegrp *kg); /* XXX in kern_switch.c */
> +
>  static void sched_setup(void *dummy);
>  SYSINIT(sched_setup, SI_SUB_KICK_SCHEDULER, SI_ORDER_FIRST, sched_setup, NULL)
>  
> @@ -76,7 +78,6 @@
>  
>  static struct callout loadav_callout;
>  static struct callout schedcpu_callout;
> -static struct callout roundrobin_callout;
>  
>  struct loadavg averunnable =
>  	{ {0, 0, 0}, FSCALE };	/* load average, of runnable procs */
> @@ -92,7 +93,6 @@
>  
>  static void	endtsleep(void *);
>  static void	loadav(void *arg);
> -static void	roundrobin(void *arg);
>  static void	schedcpu(void *arg);
>  
>  static int
> @@ -135,28 +135,6 @@
>  }
>  
>  /*
> - * Force switch among equal priority processes every 100ms.
> - * We don't actually need to force a context switch of the current process.
> - * The act of firing the event triggers a context switch to softclock() and
> - * then switching back out again which is equivalent to a preemption, thus
> - * no further work is needed on the local CPU.
> - */
> -/* ARGSUSED */
> -static void
> -roundrobin(arg)
> -	void *arg;
> -{
> -
> -#ifdef SMP
> -	mtx_lock_spin(&sched_lock);
> -	forward_roundrobin();
> -	mtx_unlock_spin(&sched_lock);
> -#endif
> -
> -	callout_reset(&roundrobin_callout, sched_quantum, roundrobin, NULL);
> -}
> -
> -/*
>   * Constants for digital decay and forget:
>   *	90% of (p_estcpu) usage in 5 * loadav time
>   *	95% of (p_pctcpu) usage in 60 seconds (load insensitive)
> @@ -225,7 +203,7 @@
>  #define	decay_cpu(loadfac, cpu)	(((loadfac) * (cpu)) / ((loadfac) + FSCALE))
>  
>  /* decay 95% of `p_pctcpu' in 60 seconds; see CCPU_SHIFT before changing */
> -static fixpt_t	ccpu = 0.95122942450071400909 * FSCALE;	/* exp(-1/20) */
> +fixpt_t	ccpu = 0.95122942450071400909 * FSCALE;	/* exp(-1/20) */
>  SYSCTL_INT(_kern, OID_AUTO, ccpu, CTLFLAG_RD, &ccpu, 0, "");
>  
>  /* kernel uses `FSCALE', userland (SHOULD) use kern.fscale */
> @@ -255,105 +233,15 @@
>  schedcpu(arg)
>  	void *arg;
>  {
> -	register fixpt_t loadfac = loadfactor(averunnable.ldavg[0]);
> -	struct thread *td;
>  	struct proc *p;
> -	struct kse *ke;
>  	struct ksegrp *kg;
> -	int realstathz;
> -	int awake;
>  
> -	realstathz = stathz ? stathz : hz;
>  	sx_slock(&allproc_lock);
>  	FOREACH_PROC_IN_SYSTEM(p) {
>  		mtx_lock_spin(&sched_lock);
>  		p->p_swtime++;
>  		FOREACH_KSEGRP_IN_PROC(p, kg) { 
> -			awake = 0;
> -			FOREACH_KSE_IN_GROUP(kg, ke) {
> -				/*
> -				 * Increment time in/out of memory and sleep
> -				 * time (if sleeping).  We ignore overflow;
> -				 * with 16-bit int's (remember them?)
> -				 * overflow takes 45 days.
> -				 */
> -				/* XXXKSE **WRONG***/
> -				/*
> -				 * the kse slptimes are not touched in wakeup
> -				 * because the thread may not HAVE a KSE
> -				 */
> -				if ((ke->ke_state == KES_ONRUNQ) ||
> -				    ((ke->ke_state == KES_THREAD) &&
> -				    (ke->ke_thread->td_state == TDS_RUNNING))) {
> -					ke->ke_slptime++;
> -				} else {
> -					ke->ke_slptime = 0;
> -					awake = 1;
> -				}
> -
> -				/*
> -				 * pctcpu is only for ps?
> -				 * Do it per kse.. and add them up at the end?
> -				 * XXXKSE
> -				 */
> -				ke->ke_pctcpu = (ke->ke_pctcpu * ccpu) >> FSHIFT;
> -				/*
> -				 * If the kse has been idle the entire second,
> -				 * stop recalculating its priority until
> -				 * it wakes up.
> -				 */
> -				if (ke->ke_slptime > 1) {
> -					continue;
> -				}
> -
> -#if	(FSHIFT >= CCPU_SHIFT)
> -				ke->ke_pctcpu += (realstathz == 100) ?
> -				    ((fixpt_t) ke->ke_cpticks) <<
> -				    (FSHIFT - CCPU_SHIFT) :
> -				    100 * (((fixpt_t) ke->ke_cpticks) <<
> -				    (FSHIFT - CCPU_SHIFT)) / realstathz;
> -#else
> -				ke->ke_pctcpu += ((FSCALE - ccpu) *
> -				    (ke->ke_cpticks * FSCALE / realstathz)) >>
> -				    FSHIFT;
> -#endif
> -				ke->ke_cpticks = 0;
> -			} /* end of kse loop */
> -			if (awake == 0) {
> -				kg->kg_slptime++;
> -			} else {
> -				kg->kg_slptime = 0;
> -			}
> -			kg->kg_estcpu = decay_cpu(loadfac, kg->kg_estcpu);
> -		      	resetpriority(kg);
> -			FOREACH_THREAD_IN_GROUP(kg, td) {
> -				int changedqueue;
> -				if (td->td_priority >= PUSER) {
> -					/*
> -					 * Only change the priority
> -					 * of threads that are still at their
> -					 * user priority. 
> -					 * XXXKSE This is problematic
> -					 * as we may need to re-order
> -					 * the threads on the KSEG list.
> -					 */
> -					changedqueue =
> -					    ((td->td_priority / RQ_PPQ) !=
> -					     (kg->kg_user_pri / RQ_PPQ));
> -
> -					td->td_priority = kg->kg_user_pri;
> -					if (changedqueue &&
> -					    td->td_state == TDS_RUNQ) {
> -						/* this could be optimised */
> -						remrunqueue(td);
> -						td->td_priority =
> -						    kg->kg_user_pri;
> -						setrunqueue(td);
> -					} else {
> -						td->td_priority = kg->kg_user_pri;
> -					}
> -				}
> -			}
> +			schedcpu1(kg);
>  		} /* end of ksegrp loop */
>  		mtx_unlock_spin(&sched_lock);
>  	} /* end of process loop */
> @@ -944,32 +832,6 @@
>  }
>  
>  /*
> - * Compute the priority of a process when running in user mode.
> - * Arrange to reschedule if the resulting priority is better
> - * than that of the current process.
> - */
> -void
> -resetpriority(kg)
> -	register struct ksegrp *kg;
> -{
> -	register unsigned int newpriority;
> -	struct thread *td;
> -
> -	mtx_lock_spin(&sched_lock);
> -	if (kg->kg_pri_class == PRI_TIMESHARE) {
> -		newpriority = PUSER + kg->kg_estcpu / INVERSE_ESTCPU_WEIGHT +
> -		    NICE_WEIGHT * (kg->kg_nice - PRIO_MIN);
> -		newpriority = min(max(newpriority, PRI_MIN_TIMESHARE),
> -		    PRI_MAX_TIMESHARE);
> -		kg->kg_user_pri = newpriority;
> -	}
> -	FOREACH_THREAD_IN_GROUP(kg, td) {
> -		maybe_resched(td);			/* XXXKSE silly */
> -	}
> -	mtx_unlock_spin(&sched_lock);
> -}
> -
> -/*
>   * Compute a tenex style load average of a quantity on
>   * 1, 5 and 15 minute intervals.
>   * XXXKSE   Needs complete rewrite when correct info is available.
> @@ -1022,11 +884,9 @@
>  {
>  
>  	callout_init(&schedcpu_callout, 1);
> -	callout_init(&roundrobin_callout, 0);
>  	callout_init(&loadav_callout, 0);
>  
>  	/* Kick off timeout driven events by calling first time. */
> -	roundrobin(NULL);
>  	schedcpu(NULL);
>  	loadav(NULL);
>  }
> 
> ----- End forwarded message -----
> 
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-arch" in the body of the message

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message