From owner-freebsd-bugs@freebsd.org Tue Oct 27 12:37:54 2015 Return-Path: Delivered-To: freebsd-bugs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 577478F4F for ; Tue, 27 Oct 2015 12:37:54 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail107.syd.optusnet.com.au (mail107.syd.optusnet.com.au [211.29.132.53]) by mx1.freebsd.org (Postfix) with ESMTP id CE981110A for ; Tue, 27 Oct 2015 12:37:53 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c211-30-166-197.carlnfd1.nsw.optusnet.com.au (c211-30-166-197.carlnfd1.nsw.optusnet.com.au [211.30.166.197]) by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id 4B3E6D47633 for ; Tue, 27 Oct 2015 23:37:44 +1100 (AEDT) Date: Tue, 27 Oct 2015 23:37:43 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org cc: freebsd-bugs@freebsd.org Subject: Re: [Bug 173541] load average 0.60 at 100% idle In-Reply-To: Message-ID: <20151027203059.M4268@besplex.bde.org> References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=R4L+YolX c=1 sm=1 tr=0 a=KA6XNC2GZCFrdESI5ZmdjQ==:117 a=PO7r1zJSAAAA:8 a=9cW_t1CCXrUA:10 a=JzwRw_2MAAAA:8 a=kj9zAlcOel0A:10 a=6I5d2MoRAAAA:8 a=wUw-jFn8E2v00FhOs0IA:9 a=8ajzKR70VV3yYtNA:21 a=sJc9BhASvHIeNabN:21 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 27 Oct 2015 12:37:54 -0000 On Tue, 27 Oct 2015 bugzilla-noreply@freebsd.org wrote: > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=173541 > > --- Comment #9 from Alexander Motin --- > I don't know whether it is the only cause of such behavior, but I have one > logical explanation. The problem goes from the sampled nature of our load > average calculation. To reduce power usage new clock code tries to aggregate > events in bursts to reduce number of CPU wakeups. As result, if system is > idle, its CPU may wakeup as low as two times per second. And if that CPU has > some other events waking it up, it is quite likely they will be aggregated with > load average sampling. As result, there is high enough probability that > sampler will see some "load" to account, while it has no idea that that load > will disappear in microsecond for another second. Enabling periodic interrupts > breaks the event aliasing -- load average sampling will more likely fire in > moments when system is really idle. But do we want to consume several extra > watts of power just for one good looking number? To solve this we should > probably change our mechanism of load average calculation somehow. If anybody > have ideas -- I am all ears. Periodic interrupts give aliasing even more surely. I complained directly to mav about this, but receive no reply (usually he replies fast). Also, statistics in top that should help debug this are very broken. I think then still work in ps, but are harder to see there. I use hz = 100 and don't use SCHED_ULE. This moves the problems. I didn't know about the problem with the load average, but it is basically the same. I saw mainly the reverse problem. With a periodic timer on a UP system, a single very active process was charged about 0% and the single CPU-idle process was changed almost 100%. This setup was relatively easy to debug (still not very easy). This was because the active process mostly ran immediately after being woken up after a timeout. The accounting and load average stuff runs before the wakekup so it charges the idle process. Then provides the active process usually runs for slightly less then the timer period, it usually gets charged nothing and the cycle repeats. The details depend on the how usually the activity time is less than the period. I only tried periodic mode because normal mode gave finer timeout granularity than I wanted. I was running tests that were sensitive to the timeout granularity and were calibrated for it being 1/hz. But normal mode now gives much finer granularity than 1/hz. This broke comparisons of test results, but avoided the aliasing problems by giving lots of jitter for the timeouts. Periodic mode gave comparable test results, but gave more aliasing problems than I have noticed before. The event timer was the lapic timer. In periodic mode it is programmed to 400 Hz. 400 is the approximate beat frequency of hz = 100 and stathz = 128. stathz is modified a little to 133.33... and then approximated by 133. The timeout granularity could be 1/400 instead of 1/hz, but this would give the same programmable security hole as most other configurations, and the actual timeout granularity seems to be 1/hz at least for processes which ask for that. Unless the implementation does anti-aliasing, and it doesn't seem to do any, this gives an aliasing frequency of 33.33 Hz -- relative to the 400 Hz clock, the hz ticks occur at periods 0, 4, 8, 12, ... and the stathz ticks occur at periods 0, 3, 6, 9, 12, so they are in phase (only) on every 12th event timer tick or every 3rd hz tick or every 4th stathz ticks. So what I said above is not quite correct. It takes a malicious process to hide from statclock by staying in phase intentionally. The security hole is just denial of service and increasing charges for other processes. I have CPU to spare so I don't care about malicious process doing this. But I want statistics to work right for testing. My test program was somehow staying in phase unintentionally. Also, once it is in phase it can run without being charged for for 3 1/hz ticks, not just 1. I think scheduling and statistics can mostly be fixed for non-malicious processes by simple anti-aliasing of statlock. With a periodic timer, this means never running it in phase. The above configuration is especially simple. The 4 statclock ticks must be distributed into just 12 periods 0-11 without hitting the hz distribution of (0, 4, 8) and there is essentially only 1 reasonable way to do this -- start at 2 to be in between 0 and 4, then use 5 since it doesn't alias, then move to avoid 8 since it does alias -- it is only reasonable to go 1 higher or lower, say higher to 9 ---, then back to 11. This gives the distribution (2, 5, 9, 11). To reduce bias, go the other way for the next cycle: (2, 5, 8, 11). To further reduce bias and increase randomness, start the cycles at 1 or 3 instead of 2. This gives adjustments that are too large for accuracy, but not much more can be done at a low frequency. Similarly for non-periodic timers and old periodic timers. Old periodic timers never did this right, although stathz was introduced in response the the old paper statclk-usenix93.ps which described how to do it right at the time. According to the paper, statclock needs to aperiodic and very unpredictable, but FreeBSD only ever implemented a periodic statclock. Old versions of it mostly worked for non-malicious processes. This depends on stathz being larger than the timeout granularity and/or on having a high beat frequency. This was the case when hz was 100 and stathz was 128. A short-running process will usually run for less than 1/stathz (more so in 1993 than now), so it won't be charged at all. But a long-running process or set of processes that doesn't intentionally hide from statclock will tend to wake up at fairly random times by interrupts, or at non-random times using timeouts. The latter were rarely in phase with statclock and didn't stay in phase on at least i386 since the hardware was different. This doesn't for malicious processes, but with hz < stathz it was hard for a malicious process to arrange for itself to keep awak often enough to gain advantage. To avoid being seen often by statclock, the process must predict when a statclock tick will occur and go to sleep as late as possible before it. Then a large timeout granularity prevents the process from waking up until much later using simple timeouts (other wakeup events are much harder to control from userland -- you can't easily generate them from another process without the other process getting charged). The best that can be done easily is: - start with hz and stathz ticks in phase - wake up on hz tick slightly after stathz ticks - run for 1/stathz - epsilon, where 1/stathz should be significantly less than 1/hz, but was only a little less (1/128 < 1/100), and was broken by changing hz from 100 to 1000 (1/1000 is much less than 1/128). - schedule timeout in 1/hz seconds (or better, use a repeating periodic to wake up every 1/hz seconds, except turn this off when it is too close to a stathz tick) - with hz = 100 and stathz = 128, we get the duty cycle: run for 78 msec; sleep for 22 msec; run for 56 msec; sleep for 44 msec; ... We can get nearly 7/8 of the CPU without being charged. - with hz = 1000 and stathz = 128: run for 99 msec; sleep for 1 msec; run for 56 msec; sleep for 1 msec; ... We can get nearly 99/100 of the CPU without being charged. - current with finer-grained timeouts supports the malicious process even better. Malicious process can do a lot of evasion of even aperiodic statclocks if the timeout granuarity is much smaller than the average statclock period. Whatever that average is, run for about half of it at a time to give about a 50% chance of not being seen. This can be improved a bit running for longer immediately after being seen. Perhaps more importantly, an aperiodic statclock would have to be very aperiodic to prevent its prediction by malicious process. Its periods shouldn't be clustered near the average since then it would be too predictable. But that makes it give worse statistics for non-malicious processes. Perhaps it would be a good kernel strategy to: - use an almost periodic statclock for processes that appear to be running a lot - use a very aperiodic statclock for processes that appear not to be running a lot Tickless kernels almost give this as a side effect. Now I describe some of the bugs in top that break observing this. top shows 2 CPU values, CPU and WCPU. CPU is supposed to be the scheduler's internal idea of the CPU usage and WCPU is supposed to a representation of this that is more understandable by users. These only make full sense for SCHED_4BSD where they are different. SCHED_ULE doesn't really have either, and fakes both to the same value. These values are often confused even for SCHED_4BSD. When a process starts up, its %CPU is initially much smaller than 100% although the process is using 100% of 1 CPU. This percentage must be multiplied by a weight 1/(1 - exp(-k*t)) to get 100%. This gives %WCPU. The weight decays rapidly to 1, so %WCP == %CPU for most processes that live long enough to be displayed by top and also to be interesting. Some of the current bugs are: - %WCPU is now unrelated to the scheduler. It is (mis)calculated by top from the thread's runtime. This gives weirdness like %WCPU for idle threads being quite different from %idle shown in the summary. The former is more accurate. The latter is maintained outside of the scheduler in the kernel but is tick-based like the scheduler. - %CPU is only useful as a SCHED_4BSD internal, but the latter is no longer displayed - for SCHED_ULE, the displayed %CPU == the displayed %WCPU as before. These are now faked by top instead of by the scheduler. I think ps still displays the kernel values. - for SCHED_4BSD, the displayed %CPU == what the user is most interested in. It is some value calculated by top. This value is quite different from the scheduler's raw %CPU and the scheduler's %WCPU - for SCHED_4BSD, the displayed %WCPU is garbage. It is the external value scaled by the internal weight 1/(1 - exp(-k*t)). When t is small, this typically converts 100% into garbage like 2000%. - although the thread's runtime is very accurate and should be nonnegative, the calculation apparently produces a small negative value sometimes. Then overflow bugs in scaling this to a percentage produces garbage like 92200000000000000% (this magic number is approx (uint64_t)INT64_MIN / 100). This bug is more noticable on UP systems. Perhaps it only happens for SCHED_4BSD too. - in one mode (I forget if it is -H or +H), the garbage percentages are clamped to 100%. This hides the garbage, and also makes percentages thet are slightly miscalculated as a little above 100% look more accurate than they really are. In the opposite mode, the garbage percentages are displayed. Typically they are on a bunch of processes that were mostly idle. Now I describe some anomalies related to load averages: - on an 8-core system that is mostly idle, with the buggy top (-SH), usually many of the cpuN-idle percentages are in the high 90's, but these are not nearly high enough, and the others oscillate wildly (mostly below 100%). %idle in the summary is stable at 100%. The former is hard to explain. I think all the runtimes are very accurate. freefall with a small load average doesn't oscillate. I suspect the difference is that my hz = 100 makes top's bugs larger here as well as elsewhere. - after upgrading to an old version of top, all the cpuN-idle percentantages are stable at 99.02%. But this is wrong too. The summary idle remains stable at 100%, and I'm sure that 100% is more accurate than 99.02%. All these percentages are tick-based. 99 looks like an off-by-1 error with the error noticable because hz = 100. - the system had a high load average just before it became idle. It seemed to take too long to decay. Perhaps counting 100% idle as 99.02% does that. - another (mostly more general problem) is that only SCHED_4BSD keeps long- term history for individual threads. I'm not sure of the load average decay is the same for the 2 schedulers. Anyway, for individual threads, we expect %WCPU to be very stable for long-running threads with SCHED_4BSD, since it takes a long time to recover from the weight of history. The user might prefer to see the transient load, but this is not the best default since it makes the top threads change more. Bruce