From owner-cvs-all@FreeBSD.ORG  Tue Oct  2 13:50:10 2007
Return-Path: <owner-cvs-all@FreeBSD.ORG>
Delivered-To: cvs-all@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 2CC9F16A41B;
	Tue,  2 Oct 2007 13:50:10 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail08.syd.optusnet.com.au (mail08.syd.optusnet.com.au
	[211.29.132.189])
	by mx1.freebsd.org (Postfix) with ESMTP id 6A04613C458;
	Tue,  2 Oct 2007 13:50:09 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from c220-239-235-248.carlnfd3.nsw.optusnet.com.au
	(c220-239-235-248.carlnfd3.nsw.optusnet.com.au [220.239.235.248])
	by mail08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	l92DnY2b023467
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Tue, 2 Oct 2007 23:49:46 +1000
Date: Tue, 2 Oct 2007 23:49:34 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@delplex.bde.org
To: Jeff Roberson <jroberson@chesapeake.net>
In-Reply-To: <20071001232743.Q539@10.0.0.1>
Message-ID: <20071002213829.F12287@delplex.bde.org>
References: <20071001145257.EC9FC4500F@ptavv.es.net>
	<20071002133623.X40629@besplex.bde.org>
	<20071001232743.Q539@10.0.0.1>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: cvs-all@freebsd.org, src-committers@freebsd.org, cvs-src@freebsd.org,
	Jeff Roberson <jeff@freebsd.org>, Garance A Drosehn <gad@freebsd.org>,
	Ben Kaduk <minimarmot@gmail.com>, Bruce Evans <brde@optusnet.com.au>
Subject: Re: cvs commit: src/sys/kern sched_ule.c 
X-BeenThere: cvs-all@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: CVS commit messages for the entire tree <cvs-all.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/cvs-all>,
	<mailto:cvs-all-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/cvs-all>
List-Post: <mailto:cvs-all@freebsd.org>
List-Help: <mailto:cvs-all-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/cvs-all>,
	<mailto:cvs-all-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 02 Oct 2007 13:50:10 -0000

On Mon, 1 Oct 2007, Jeff Roberson wrote:

> On Tue, 2 Oct 2007, Bruce Evans wrote:

>> Further testing of my ~4BSD scheduler in ~5.2 indicates that when a
>> process wants less than about 1/loadavg of the CPU on average, it
>> usually just gets it, with no scheduling delays, since it usually has
>> higher priority than all other user processes.  Otherwise, the worst-case
>> scheduling delays increase from ~10 msec to ~2 seconds.  It is easy
>> to reduce the scheduling quantum from its default of 100 msec by a
>> factor of 100, but this doesn't seem to work right.  So the behaviour
>> is very dependent on the load and on the amount of CPU wanted by the
>> interactive process.

[Read the middle of this bloated mail, about debugging ULE, first.]

This is only for my ~5.2 etc. with the queuing hacked backed out.  I
think real 5.2 and 4.x act similarly, except at least 4.x has a bad
policy for priority inheritance on fork/exit which can cause the
priority to grow exponentially in the number of descendants (except
it is clamped to a maximum, so the growth is just nonlinear and breaks
various things when the limit is reached.  I tested a 4.10 kernel a bit
today but didn't have enough 4.x utilities in my userland to see what
it is doing.

-current with 4BSD is much worse than this.  I observed a worst-case
scheduling delay of > 26 seconds.  Mouse movements are jerky.

-current with ULE, after debugging the configuration, is slightly worse
than this.  Mouse movements aren't jerky.  But ULE seems to often
mispredict when a process is interactive, and it sometimes gets into
a state where one process (not an interactive one) is given 100% CPU
for too long while many other processes are runnable.

>> ...
>> 
>> I now have more experience with ULE.  A version built today gave
>> dramatically worse interactivity, so much so that I think it must have
>> been broken recently.  A simple shell loop hangs the rest of the system
>> in some cases, and a background build has similar bad effects, probably
>> limited mainly by useful loops not being endless.
>
> I'm not able to reproduce this and no one else has reported it.

This always happens with hz = 100.  Reducing preempt_thresh to below
about 50 mostly fixes the problem, and reducing the threshold to 0
fixes the problem a bit more.  The shell loop processes still take too
long to start up (often several seconds for just 20), but the second
process starts within a second, instead of showing signs of taking
forever to start up.  Apparently, in the broken case, an IPI to stop
the first process is never delivered.  ^Z works to stop the whole
process group, and then two %'s to usually result in proceeding to
the next process.  Having to use two %'s is strange but may be just
a shell bug.

-current with 4BSD also takes too long to start all the processes,
while ~5.2 restarts them all apparently-instantly.  In fact it starts
them too fast and runs into the old exec resource shortage bug after
16 processes and 3 or 4 or the starts fail in exec.

With hz = 1000 and ULE, the default preempt_thresh of 64 works but
reducing it to 0 works better.  Startup is still too slow.

Apparently, there is a scaling bug for hz or extra interrupts for
the larger hz help, and the default preempt_thresh is not best.

I saw this behaviour for 2 different kernels:
- SMP kernel (all this is running on an A64 UP in i386 mode) built on
   Aug 5.  Timer interrupts were via the APIC.  hz was set to 100 at
   boot time.  stathz was always 100 and in perfect sync with hz.
   (Plain current with APIC timer interrupts gives a broken stathz of
   13 when hz is 100, and stathz in bogus sync with hz.)
- UP kernel built today.  Timer interrupts were via the i8254 and the
   RTC.  hz was set to 100 or 1000 at boot time.  stathz was always
   128.  The different interrupt configuration and timing (except for
   increasing hz for ULE) made little difference.

The SMP kernel got a bit further in the shell loop startup when hz = 100
but otherwise behaved similarly.

> This may be 
> the result of some incompatibility between bdebsd and ULE.

Nah, I don't use ULE in bdebsd (except all userland is bdebsd), and
I don't touch schedulers in -current (I mainly touch filesystems and
network drivers).  Current kernels are remarkably compatible with
old userlands.

> Is this a SMP 
> machine?  Do you have PREEMPTION enabled?  ULE recently started honoring 
> preemption.  Try setting:

See above.  Always PREEMPTION for UP, since without it problems like the
above are almost to be expected.  I think 5.2 has them.  ~5.2 preempts
a lot as a side effect of switching context for clock interrupt handlers
and then (without the queueing hack) rescheduling on switching back.

> kern.sched.preempt_thresh: 64

But this setting is part of the problem.

> if it is not already.  I know you deal with hardclock differently. Without 
> PREEMPTION it may not work correctly.

No, the difference for hardclock is not in ULE kernels.

>> First I tried an old regression test for nice[1-2]:
>> 
>> %%%
>> for i in 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
>> do
>>    nice -$i sh -c "while :; do echo -n;done" &
>> done
>> top -o time
>> %%%
>
> I use this:
> for i in -20 -16 -12 -8 -4 0 4 8 12 16 20
> do
>        nice -$i sh -c "while :; do echo -n;done" &
> done
> top -o time
>
> I like to verify that the distribution doesn't get out of whack.  It takes

Then non-multiple-of-4 entries in my list are almost useless.  I mostly
use the [0-20] list because it is in the first file in a test directory
and doesn't have any negative values so it doesn't need privilege to run.

> some time to settle before the higher nice threads get enough runtime to sort 
> properly.  My results are as so:

The settling time/inertia is both a bug and a feature.  It's good to have
inertia for long-running processes, but makeworld can start several hundred
processes per second and finish many of them so there is nowhere near
enough settling time for these processes so their behaviour is hard to
predict.

>  868 root          1  81  -20  3492K  1404K RUN      0:28 23.58% sh
>  869 root          1  83  -16  3492K  1404K RUN      0:20 15.09% sh
>  870 root          1  86  -12  3492K  1404K RUN      0:16 12.16% sh
>  871 root          1  90   -8  3492K  1404K RUN      0:12  8.89% sh
>  872 root          1  93   -4  3492K  1404K RUN      0:11  7.96% sh
>  873 root          1  97    0  3492K  1404K RUN      0:09  6.59% sh
>  874 root          1 101    4  3492K  1404K RUN      0:08  4.88% sh
>  875 root          1 105    8  3492K  1404K RUN      0:07  5.37% sh
>  876 root          1 109   12  3492K  1404K RUN      0:06  3.37% sh
>  877 root          1 113   16  3492K  1404K RUN      0:06  4.05% sh
>  878 root          1 116   20  3492K  1404K RUN      0:05  3.96% sh
>
> Really might not be enough difference with positive nice values.  I've never 
> really had a good feeling about how nice should really behave but this mostly 
> seems reasonable.  It would be possible to tweak the algorithm to further 
> penalize nice.

I still use a table-driven algorithm with weights 2**(nice_value/4).  This
gives a dynamic range of a factor 1024.

>> This hung after starting only about one of the shell processes.  After
>> cutting the list down to just one process with nice -20, it still hung.
>> Shells on other syscons terminals running at rtprio 0 could not compete
>> with the nice -20 process:
>> - they could not start top to look at what was happening
>> - an already-running could not display anything new
>> - they could not start killall.
>> With the list cut down to about 6 processes, ps in ddb showed evidence of 
>> all the processes starting, and I was able to kill them all using
>> kill in ddb.

Fixed using larger hz and/or smaller preempt_thresh; ddb wasn't necessary
since ^Z worked (if hit it before ^C?) -- see above.

>> [hz = 100 case not so bad]

Other stange behaviour with preempt_thresh = 64, at least with hz = 100:
start two identical CPU hogs, each with a runtime of 2.5 seconds, on
separate consoles.  Then one is given 100% of the CPU until it completes,
and it is always the second one started that gets 100% CPU first.  Thus
the first one started takes about 5.0 seconds to complete and the second
one started takes about 2.5 seconds to complete.

>> Running makeworld with just -j4 n the background gives similar symptoms.
>> When a new process is started, it sometimes gets too many cycles to
>> begin with, and apparently completely stops all processes in the
>> makeworld (but not the top displaying things) for several seconds.
>> After a while (I guess when the interactivity score descreases), this
>> behaviour changes to giving the new process very few cycles even if
>> it is semi-interactive (a foreground process started from a shell).

~5.2 behaves similarly, but I think a little better.  In ~5.2 (and
maybe in all schedulers), the initial priority is just a function of
the parent's priority (I use a simple function that might be slightly
different from 5.2.  I forget what it is).  If neither the parent nor
the child runs for long, then new processes tend to get almost all the
CPU until they run for too long.  When the children exit, the parent
inherits some priority according to another simple function.  ~5.2
works best here since it uses better functions than 5.2 does (much
better than the exponential functions in 4.x), and it keeps track of
history better than ULE can.

I tested this mainly using:

 	time /tmp/q1 & time /tmp/q1 & acroread *pdf   # type ^q to exit acroread

where /tmp/q1 measures latency by calling clock_gettime() in a loop and
there are 12 pdf files of total size 4.75MB.  acroread is sufficiently
bloated and hoggish to have very bad behaviour here.  The results when
this is run on an xterm that has initially been idle for some time (or
is in some more magic state for ULE interactivity?) at loadavg 20 are
approximately:

 	all: acroread starts fast for the first few runs (would be ~ 1
 	    seconds with no load; this only increases by a second or two)

 	    /tmp/q1 runs for ~2.5 seconds self time and shows low max
 	    latency (would be ~ 200 usec with no load; this increases to
 	    ~10 msec; both high variance)
 	~5.2-4BSD: after a few runs, the parent priority becomes near the
 	    max so further runs take 5-10 seconds to start.  20 seconds at
 	    a load avg of 20 would be fairer, but the parent priority
 	    doesn't get as near the max as background hog's priorities.

 	    After a few runs, max latency is usually 100-500 msec and was
 	    once 2 seconds.

 	    Latency in mouse movements is not noticable
 	current-4BSD: further runs don't take much longer to start.
 	    Apparently the parent doesn't inherit enough priority.
 	    (In 4.2 it inherited far too much.)

 	    After a few runs, max latency is usually 1-2 seconds and was
 	    once 27 seconds.

 	    The latency of 1-2 is often noticeable for mouse movements and
 	    even for echo in xterms.
 	current-ULE: further runs sometimes take _much_ longer, a minute
 	    or so, and there is a high variance in the length.

 	    After a few runs, max latency is usually a few hundred msec
 	    larger than for ~5.2.

 	    Latency in mouse movements is not noticable


>> In at least this phase, ^C to kill processes doesn't work, but ^Z to
>> suspend them and then kill from the shell works normally, and interactivity
>> in not-very-bloated mail programs and editors is very bad.  A

^C fails only in the phase where hz is small, preempt_thresh is larger,
and (?) the parent hasn't gained much priority and/or (negative?)
interactivity.

>> Other behaviour with 4BSD schedulers and various kernels:
>> - the max scheduling delay is almost independent of the CPU speed.

This may be because it is just a function of the priorities which are
mainly a function of the algorithm.

>> - the max scheduling delay is slightly worse for -current with 4BSD
>>  than with my ~5.2.

Acually, it is much worse.

>> - -current has anomalous behaviour relative to ~5.2 for background
>>  makeworld -j16: many fewer runnable processes, a much smaller max
>>  load average, and many more zombies visible when top looks.

This may be related to the slow startup of the shell loops and caused by
the priority inheritance for fork/exit.

>> - [queue hack]
>>  ...
>>  essentially roundrobin scheduling under loads that generate lots
>>  of interrupts.  Interactivity is still poor because makeworld
>>  sometimes generates a few hundred processes per second and cycling
>>  through that many takes a long time even with a tiny quantum.

makeworld actually generates remarkably few interrupts when run on
disk file systems (an average of only about 30 non-clock interrupts per
second in my config).

>> - reducing kern.sched.quantum never had much effect.  Same for
>>  increasing HZ in -current with 4BSD.

Bruce