FreeBSD Mail Archives

Date:      Sun, 26 Jan 2003 11:13:39 -0800 (PST)
From:      Matthew Dillon <dillon@apollo.backplane.com>
To:        Jeff Roberson <jroberson@chesapeake.net>
Cc:        Steve Kargl <sgk@troutmask.apl.washington.edu>, Robert Watson <rwatson@FreeBSD.ORG>, Gary Jennejohn <garyj@jennejohn.org>, <arch@FreeBSD.ORG>
Subject:   Re: New scheduler - Interactivity fixes
Message-ID:  <200301261913.h0QJDd8c051876@apollo.backplane.com>
References:   <20030126035429.A64928-100000@mail.chesapeake.net>

:I actually have a local patch that does just this.  It didn't improve the
:situation for my buildworld -j4 on a dual box.  I'd like to leave the
:oncpu alone in sched_fork() as you suggest and instead move it to a new
:call, sched_exec().  My logical here is that since you're completely
:replacing the vm space you lose any locality advantage so might as well
:pick the least loaded cpu.

    I'll play with this a little today to see if I can improve the
    build times.  The real time is still pretty horrendous:
     
/usr/bin/time make -j 8 buildworld;	# DELL2550 2xCPU

ULE -
     3435.42 real      2500.73 user       581.20 sys
     3343.95 real      2501.86 user       581.67 sys
 
4BSD -
     2414.84 real      2648.92 user       758.28 sys


    Though these numbers, like the numbers you originally posted, seem to
    imply that if we deal our cards right we can improve user and system
    times by doing a better job of avoiding cache issues when we do start
    stealing processes from other cpu's queues.

:I think we need a push and a pull.  The push could run periodically and
:sort the load of all cpus then see how far off the least and most loaded
:are.  I need to come up with some metric that is more interesting than the
:number of entries in the runq for the load balancing though.  It doesn't
:take into consideration the priority spread.  Also, the run queue depth
:can change so quickly if processes are doing lots of IO.

    I don't think periodic balancing has ever worked.  The balancing really
    has to be near term or instantanious (in sched_choose() itself) to be
    effective.  Run-queue depth isn't as important as being able to give
    a process the ability to run until it next blocks (i.e. reduce 
    involuntary context switches).  Not an easy problem to be sure
    since excessive latency also causes problems.

:It would be nice to have something like the total slice time of all
:runnable processes and processes sleeping for a very short period of time
:on a given cpu.  Since the slice size is related to the priority, you
:would get a much more even load that way.

    Total slice time is getting into the fractional-fair-share scheduler
    methodology.  I'll outline it below (and later today I should have a
    sched_ffs.c to show you, it's turning out to be easier then I thought
    it would be!).

    A fractional fair share scheduler basically works like this.  In this
    example, higher priorities are better (in FreeBSD it's reversed, higher
    priorities are worse):

    sched_add(task)
    {
	GlobalAgg += task->priority;
	AddHead(queue, task);
    }

    sched_del(task)
    {
	GlobalAgg -= task->priority;
	RemoveNode(task);
    }

    sched_clock()
    {
	task->slice -= GlobalAgg;
	if (task->slice < 0)
		involuntary_switch();
    }

    sched_choose()
    {
	while ((task = RemoveHead(queue)) != NULL) {
	    if (task->slice >= 0)
		break;
	    task->slice += task->priority * NOMINAL_TICKS;	(usually 4)
	    AddTail(queue, task);
	}
	return(task);
    }

    There are some fundamental features to this sort of scheduler:

    * Scheduled tasks are added to the head of the queue.  You don't get into
      races because a task's slice is only improved when the chooser decides
      to skip it. 

      This generally gives you optimal scheduling for interrupts and pipes
      (two processes ping-ponging each other with data).  The two processes
      remain cohesive until they both run out of their time slice, which
      is extremely cache efficient.

    * The time slice is docked by the aggregate priority, which means that
      lower priority or currently-running tasks lose their time slice more
      quickly when a high priority task is scheduled.

    * The time slice is fractional-fair.  The priority is not sorted and 
      overall cpu utilization will be almost exactly the fraction of the
      task's priority verses the aggregate priority regardless of the 
      granularity of the clock.

    In the version I'm working up I'm adding additional code to deal with
    interactive processes (reducing the priority of batch processes almost
    exactly the same way its done in your code, except using a runtime and
    a slptime instead of just a slptime).

    There is one big downside to a fractional fair scheduler, at least as
    I have described, and that is a certain degree of inefficiency if you
    have hundreds of runnable low priority processes and just a couple of
    high priority proceseses.   The algorithm does compensate somewhat
    for this situation by giving the high priority processes larger slices
    (due to the priority multiplication in sched_clock() verses the 
    global aggregate), but it's still an issue.  There is also an issue
    with interrupt latency though the vast majority of interrupts do in fact
    run almost instantly due to the front-loading in sched_add().

						    -Matt



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200301261913.h0QJDd8c051876>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation