From owner-cvs-all@FreeBSD.ORG Tue Oct 2 13:50:10 2007 Return-Path: Delivered-To: cvs-all@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2CC9F16A41B; Tue, 2 Oct 2007 13:50:10 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail08.syd.optusnet.com.au (mail08.syd.optusnet.com.au [211.29.132.189]) by mx1.freebsd.org (Postfix) with ESMTP id 6A04613C458; Tue, 2 Oct 2007 13:50:09 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c220-239-235-248.carlnfd3.nsw.optusnet.com.au (c220-239-235-248.carlnfd3.nsw.optusnet.com.au [220.239.235.248]) by mail08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l92DnY2b023467 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 2 Oct 2007 23:49:46 +1000 Date: Tue, 2 Oct 2007 23:49:34 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Jeff Roberson In-Reply-To: <20071001232743.Q539@10.0.0.1> Message-ID: <20071002213829.F12287@delplex.bde.org> References: <20071001145257.EC9FC4500F@ptavv.es.net> <20071002133623.X40629@besplex.bde.org> <20071001232743.Q539@10.0.0.1> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: cvs-all@freebsd.org, src-committers@freebsd.org, cvs-src@freebsd.org, Jeff Roberson , Garance A Drosehn , Ben Kaduk , Bruce Evans Subject: Re: cvs commit: src/sys/kern sched_ule.c X-BeenThere: cvs-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: CVS commit messages for the entire tree List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 02 Oct 2007 13:50:10 -0000 On Mon, 1 Oct 2007, Jeff Roberson wrote: > On Tue, 2 Oct 2007, Bruce Evans wrote: >> Further testing of my ~4BSD scheduler in ~5.2 indicates that when a >> process wants less than about 1/loadavg of the CPU on average, it >> usually just gets it, with no scheduling delays, since it usually has >> higher priority than all other user processes. Otherwise, the worst-case >> scheduling delays increase from ~10 msec to ~2 seconds. It is easy >> to reduce the scheduling quantum from its default of 100 msec by a >> factor of 100, but this doesn't seem to work right. So the behaviour >> is very dependent on the load and on the amount of CPU wanted by the >> interactive process. [Read the middle of this bloated mail, about debugging ULE, first.] This is only for my ~5.2 etc. with the queuing hacked backed out. I think real 5.2 and 4.x act similarly, except at least 4.x has a bad policy for priority inheritance on fork/exit which can cause the priority to grow exponentially in the number of descendants (except it is clamped to a maximum, so the growth is just nonlinear and breaks various things when the limit is reached. I tested a 4.10 kernel a bit today but didn't have enough 4.x utilities in my userland to see what it is doing. -current with 4BSD is much worse than this. I observed a worst-case scheduling delay of > 26 seconds. Mouse movements are jerky. -current with ULE, after debugging the configuration, is slightly worse than this. Mouse movements aren't jerky. But ULE seems to often mispredict when a process is interactive, and it sometimes gets into a state where one process (not an interactive one) is given 100% CPU for too long while many other processes are runnable. >> ... >> >> I now have more experience with ULE. A version built today gave >> dramatically worse interactivity, so much so that I think it must have >> been broken recently. A simple shell loop hangs the rest of the system >> in some cases, and a background build has similar bad effects, probably >> limited mainly by useful loops not being endless. > > I'm not able to reproduce this and no one else has reported it. This always happens with hz = 100. Reducing preempt_thresh to below about 50 mostly fixes the problem, and reducing the threshold to 0 fixes the problem a bit more. The shell loop processes still take too long to start up (often several seconds for just 20), but the second process starts within a second, instead of showing signs of taking forever to start up. Apparently, in the broken case, an IPI to stop the first process is never delivered. ^Z works to stop the whole process group, and then two %'s to usually result in proceeding to the next process. Having to use two %'s is strange but may be just a shell bug. -current with 4BSD also takes too long to start all the processes, while ~5.2 restarts them all apparently-instantly. In fact it starts them too fast and runs into the old exec resource shortage bug after 16 processes and 3 or 4 or the starts fail in exec. With hz = 1000 and ULE, the default preempt_thresh of 64 works but reducing it to 0 works better. Startup is still too slow. Apparently, there is a scaling bug for hz or extra interrupts for the larger hz help, and the default preempt_thresh is not best. I saw this behaviour for 2 different kernels: - SMP kernel (all this is running on an A64 UP in i386 mode) built on Aug 5. Timer interrupts were via the APIC. hz was set to 100 at boot time. stathz was always 100 and in perfect sync with hz. (Plain current with APIC timer interrupts gives a broken stathz of 13 when hz is 100, and stathz in bogus sync with hz.) - UP kernel built today. Timer interrupts were via the i8254 and the RTC. hz was set to 100 or 1000 at boot time. stathz was always 128. The different interrupt configuration and timing (except for increasing hz for ULE) made little difference. The SMP kernel got a bit further in the shell loop startup when hz = 100 but otherwise behaved similarly. > This may be > the result of some incompatibility between bdebsd and ULE. Nah, I don't use ULE in bdebsd (except all userland is bdebsd), and I don't touch schedulers in -current (I mainly touch filesystems and network drivers). Current kernels are remarkably compatible with old userlands. > Is this a SMP > machine? Do you have PREEMPTION enabled? ULE recently started honoring > preemption. Try setting: See above. Always PREEMPTION for UP, since without it problems like the above are almost to be expected. I think 5.2 has them. ~5.2 preempts a lot as a side effect of switching context for clock interrupt handlers and then (without the queueing hack) rescheduling on switching back. > kern.sched.preempt_thresh: 64 But this setting is part of the problem. > if it is not already. I know you deal with hardclock differently. Without > PREEMPTION it may not work correctly. No, the difference for hardclock is not in ULE kernels. >> First I tried an old regression test for nice[1-2]: >> >> %%% >> for i in 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 >> do >> nice -$i sh -c "while :; do echo -n;done" & >> done >> top -o time >> %%% > > I use this: > for i in -20 -16 -12 -8 -4 0 4 8 12 16 20 > do > nice -$i sh -c "while :; do echo -n;done" & > done > top -o time > > I like to verify that the distribution doesn't get out of whack. It takes Then non-multiple-of-4 entries in my list are almost useless. I mostly use the [0-20] list because it is in the first file in a test directory and doesn't have any negative values so it doesn't need privilege to run. > some time to settle before the higher nice threads get enough runtime to sort > properly. My results are as so: The settling time/inertia is both a bug and a feature. It's good to have inertia for long-running processes, but makeworld can start several hundred processes per second and finish many of them so there is nowhere near enough settling time for these processes so their behaviour is hard to predict. > 868 root 1 81 -20 3492K 1404K RUN 0:28 23.58% sh > 869 root 1 83 -16 3492K 1404K RUN 0:20 15.09% sh > 870 root 1 86 -12 3492K 1404K RUN 0:16 12.16% sh > 871 root 1 90 -8 3492K 1404K RUN 0:12 8.89% sh > 872 root 1 93 -4 3492K 1404K RUN 0:11 7.96% sh > 873 root 1 97 0 3492K 1404K RUN 0:09 6.59% sh > 874 root 1 101 4 3492K 1404K RUN 0:08 4.88% sh > 875 root 1 105 8 3492K 1404K RUN 0:07 5.37% sh > 876 root 1 109 12 3492K 1404K RUN 0:06 3.37% sh > 877 root 1 113 16 3492K 1404K RUN 0:06 4.05% sh > 878 root 1 116 20 3492K 1404K RUN 0:05 3.96% sh > > Really might not be enough difference with positive nice values. I've never > really had a good feeling about how nice should really behave but this mostly > seems reasonable. It would be possible to tweak the algorithm to further > penalize nice. I still use a table-driven algorithm with weights 2**(nice_value/4). This gives a dynamic range of a factor 1024. >> This hung after starting only about one of the shell processes. After >> cutting the list down to just one process with nice -20, it still hung. >> Shells on other syscons terminals running at rtprio 0 could not compete >> with the nice -20 process: >> - they could not start top to look at what was happening >> - an already-running could not display anything new >> - they could not start killall. >> With the list cut down to about 6 processes, ps in ddb showed evidence of >> all the processes starting, and I was able to kill them all using >> kill in ddb. Fixed using larger hz and/or smaller preempt_thresh; ddb wasn't necessary since ^Z worked (if hit it before ^C?) -- see above. >> [hz = 100 case not so bad] Other stange behaviour with preempt_thresh = 64, at least with hz = 100: start two identical CPU hogs, each with a runtime of 2.5 seconds, on separate consoles. Then one is given 100% of the CPU until it completes, and it is always the second one started that gets 100% CPU first. Thus the first one started takes about 5.0 seconds to complete and the second one started takes about 2.5 seconds to complete. >> Running makeworld with just -j4 n the background gives similar symptoms. >> When a new process is started, it sometimes gets too many cycles to >> begin with, and apparently completely stops all processes in the >> makeworld (but not the top displaying things) for several seconds. >> After a while (I guess when the interactivity score descreases), this >> behaviour changes to giving the new process very few cycles even if >> it is semi-interactive (a foreground process started from a shell). ~5.2 behaves similarly, but I think a little better. In ~5.2 (and maybe in all schedulers), the initial priority is just a function of the parent's priority (I use a simple function that might be slightly different from 5.2. I forget what it is). If neither the parent nor the child runs for long, then new processes tend to get almost all the CPU until they run for too long. When the children exit, the parent inherits some priority according to another simple function. ~5.2 works best here since it uses better functions than 5.2 does (much better than the exponential functions in 4.x), and it keeps track of history better than ULE can. I tested this mainly using: time /tmp/q1 & time /tmp/q1 & acroread *pdf # type ^q to exit acroread where /tmp/q1 measures latency by calling clock_gettime() in a loop and there are 12 pdf files of total size 4.75MB. acroread is sufficiently bloated and hoggish to have very bad behaviour here. The results when this is run on an xterm that has initially been idle for some time (or is in some more magic state for ULE interactivity?) at loadavg 20 are approximately: all: acroread starts fast for the first few runs (would be ~ 1 seconds with no load; this only increases by a second or two) /tmp/q1 runs for ~2.5 seconds self time and shows low max latency (would be ~ 200 usec with no load; this increases to ~10 msec; both high variance) ~5.2-4BSD: after a few runs, the parent priority becomes near the max so further runs take 5-10 seconds to start. 20 seconds at a load avg of 20 would be fairer, but the parent priority doesn't get as near the max as background hog's priorities. After a few runs, max latency is usually 100-500 msec and was once 2 seconds. Latency in mouse movements is not noticable current-4BSD: further runs don't take much longer to start. Apparently the parent doesn't inherit enough priority. (In 4.2 it inherited far too much.) After a few runs, max latency is usually 1-2 seconds and was once 27 seconds. The latency of 1-2 is often noticeable for mouse movements and even for echo in xterms. current-ULE: further runs sometimes take _much_ longer, a minute or so, and there is a high variance in the length. After a few runs, max latency is usually a few hundred msec larger than for ~5.2. Latency in mouse movements is not noticable >> In at least this phase, ^C to kill processes doesn't work, but ^Z to >> suspend them and then kill from the shell works normally, and interactivity >> in not-very-bloated mail programs and editors is very bad. A ^C fails only in the phase where hz is small, preempt_thresh is larger, and (?) the parent hasn't gained much priority and/or (negative?) interactivity. >> Other behaviour with 4BSD schedulers and various kernels: >> - the max scheduling delay is almost independent of the CPU speed. This may be because it is just a function of the priorities which are mainly a function of the algorithm. >> - the max scheduling delay is slightly worse for -current with 4BSD >> than with my ~5.2. Acually, it is much worse. >> - -current has anomalous behaviour relative to ~5.2 for background >> makeworld -j16: many fewer runnable processes, a much smaller max >> load average, and many more zombies visible when top looks. This may be related to the slow startup of the shell loops and caused by the priority inheritance for fork/exit. >> - [queue hack] >> ... >> essentially roundrobin scheduling under loads that generate lots >> of interrupts. Interactivity is still poor because makeworld >> sometimes generates a few hundred processes per second and cycling >> through that many takes a long time even with a tiny quantum. makeworld actually generates remarkably few interrupts when run on disk file systems (an average of only about 30 non-clock interrupts per second in my config). >> - reducing kern.sched.quantum never had much effect. Same for >> increasing HZ in -current with 4BSD. Bruce