Date: Sun, 26 Jan 2003 11:13:39 -0800 (PST) From: Matthew Dillon <dillon@apollo.backplane.com> To: Jeff Roberson <jroberson@chesapeake.net> Cc: Steve Kargl <sgk@troutmask.apl.washington.edu>, Robert Watson <rwatson@FreeBSD.ORG>, Gary Jennejohn <garyj@jennejohn.org>, <arch@FreeBSD.ORG> Subject: Re: New scheduler - Interactivity fixes Message-ID: <200301261913.h0QJDd8c051876@apollo.backplane.com> References: <20030126035429.A64928-100000@mail.chesapeake.net>
next in thread | previous in thread | raw e-mail | index | archive | help
:I actually have a local patch that does just this. It didn't improve the :situation for my buildworld -j4 on a dual box. I'd like to leave the :oncpu alone in sched_fork() as you suggest and instead move it to a new :call, sched_exec(). My logical here is that since you're completely :replacing the vm space you lose any locality advantage so might as well :pick the least loaded cpu. I'll play with this a little today to see if I can improve the build times. The real time is still pretty horrendous: /usr/bin/time make -j 8 buildworld; # DELL2550 2xCPU ULE - 3435.42 real 2500.73 user 581.20 sys 3343.95 real 2501.86 user 581.67 sys 4BSD - 2414.84 real 2648.92 user 758.28 sys Though these numbers, like the numbers you originally posted, seem to imply that if we deal our cards right we can improve user and system times by doing a better job of avoiding cache issues when we do start stealing processes from other cpu's queues. :I think we need a push and a pull. The push could run periodically and :sort the load of all cpus then see how far off the least and most loaded :are. I need to come up with some metric that is more interesting than the :number of entries in the runq for the load balancing though. It doesn't :take into consideration the priority spread. Also, the run queue depth :can change so quickly if processes are doing lots of IO. I don't think periodic balancing has ever worked. The balancing really has to be near term or instantanious (in sched_choose() itself) to be effective. Run-queue depth isn't as important as being able to give a process the ability to run until it next blocks (i.e. reduce involuntary context switches). Not an easy problem to be sure since excessive latency also causes problems. :It would be nice to have something like the total slice time of all :runnable processes and processes sleeping for a very short period of time :on a given cpu. Since the slice size is related to the priority, you :would get a much more even load that way. Total slice time is getting into the fractional-fair-share scheduler methodology. I'll outline it below (and later today I should have a sched_ffs.c to show you, it's turning out to be easier then I thought it would be!). A fractional fair share scheduler basically works like this. In this example, higher priorities are better (in FreeBSD it's reversed, higher priorities are worse): sched_add(task) { GlobalAgg += task->priority; AddHead(queue, task); } sched_del(task) { GlobalAgg -= task->priority; RemoveNode(task); } sched_clock() { task->slice -= GlobalAgg; if (task->slice < 0) involuntary_switch(); } sched_choose() { while ((task = RemoveHead(queue)) != NULL) { if (task->slice >= 0) break; task->slice += task->priority * NOMINAL_TICKS; (usually 4) AddTail(queue, task); } return(task); } There are some fundamental features to this sort of scheduler: * Scheduled tasks are added to the head of the queue. You don't get into races because a task's slice is only improved when the chooser decides to skip it. This generally gives you optimal scheduling for interrupts and pipes (two processes ping-ponging each other with data). The two processes remain cohesive until they both run out of their time slice, which is extremely cache efficient. * The time slice is docked by the aggregate priority, which means that lower priority or currently-running tasks lose their time slice more quickly when a high priority task is scheduled. * The time slice is fractional-fair. The priority is not sorted and overall cpu utilization will be almost exactly the fraction of the task's priority verses the aggregate priority regardless of the granularity of the clock. In the version I'm working up I'm adding additional code to deal with interactive processes (reducing the priority of batch processes almost exactly the same way its done in your code, except using a runtime and a slptime instead of just a slptime). There is one big downside to a fractional fair scheduler, at least as I have described, and that is a certain degree of inefficiency if you have hundreds of runnable low priority processes and just a couple of high priority proceseses. The algorithm does compensate somewhat for this situation by giving the high priority processes larger slices (due to the priority multiplication in sched_clock() verses the global aggregate), but it's still an issue. There is also an issue with interrupt latency though the vast majority of interrupts do in fact run almost instantly due to the front-loading in sched_add(). -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200301261913.h0QJDd8c051876>