FreeBSD Mail Archives

Date:      Tue, 27 Oct 2015 23:37:43 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
Cc:        freebsd-bugs@freebsd.org
Subject:   Re: [Bug 173541] load average 0.60 at 100% idle
Message-ID:  <20151027203059.M4268@besplex.bde.org>
In-Reply-To: <bug-173541-8-HoRFWtSpf0@https.bugs.freebsd.org/bugzilla/>
References:  <bug-173541-8@https.bugs.freebsd.org/bugzilla/> <bug-173541-8-HoRFWtSpf0@https.bugs.freebsd.org/bugzilla/>

On Tue, 27 Oct 2015 bugzilla-noreply@freebsd.org wrote:

> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=173541
>
> --- Comment #9 from Alexander Motin <mav@FreeBSD.org> ---
> I don't know whether it is the only cause of such behavior, but I have one
> logical explanation.  The problem goes from the sampled nature of our load
> average calculation.  To reduce power usage new clock code tries to aggregate
> events in bursts to reduce number of CPU wakeups.  As result, if system is
> idle, its CPU may wakeup as low as two times per second.  And if that CPU has
> some other events waking it up, it is quite likely they will be aggregated with
> load average sampling.  As result, there is high enough probability that
> sampler will see some "load" to account, while it has no idea that that load
> will disappear in microsecond for another second.  Enabling periodic interrupts
> breaks the event aliasing -- load average sampling will more likely fire in
> moments when system is really idle.  But do we want to consume several extra
> watts of power just for one good looking number?  To solve this we should
> probably change our mechanism of load average calculation somehow.  If anybody
> have ideas -- I am all ears.

Periodic interrupts give aliasing even more surely.  I complained directly
to mav about this, but receive no reply (usually he replies fast).

Also, statistics in top that should help debug this are very broken.  I
think then still work in ps, but are harder to see there.

I use hz = 100 and don't use SCHED_ULE.  This moves the problems.  I
didn't know about the problem with the load average, but it is basically
the same.  I saw mainly the reverse problem.  With a periodic timer
on a UP system, a single very active process was charged about 0% and
the single CPU-idle process was changed almost 100%.  This setup was
relatively easy to debug (still not very easy).  This was because the
active process mostly ran immediately after being woken up after a
timeout.  The accounting and load average stuff runs before the wakekup
so it charges the idle process.  Then provides the active process usually
runs for slightly less then the timer period, it usually gets charged
nothing and the cycle repeats.  The details depend on the how usually
the activity time is less than the period.

I only tried periodic mode because normal mode gave finer timeout
granularity than I wanted.  I was running tests that were sensitive
to the timeout granularity and were calibrated for it being 1/hz.  But
normal mode now gives much finer granularity than 1/hz.  This broke
comparisons of test results, but avoided the aliasing problems by
giving lots of jitter for the timeouts.  Periodic mode gave comparable
test results, but gave more aliasing problems than I have noticed before.

The event timer was the lapic timer.  In periodic mode it is programmed
to 400 Hz.  400 is the approximate beat frequency of hz = 100 and stathz
= 128.  stathz is modified a little to 133.33... and then approximated
by 133.  The timeout granularity could be 1/400 instead of 1/hz, but
this would give the same programmable security hole as most other
configurations, and the actual timeout granularity seems to be 1/hz at
least for processes which ask for that.  Unless the implementation does
anti-aliasing, and it doesn't seem to do any, this gives an aliasing
frequency of 33.33 Hz -- relative to the 400 Hz clock, the hz ticks occur
at periods 0, 4, 8, 12, ... and the stathz ticks occur at periods 0, 3,
6, 9, 12, so they are in phase (only) on every 12th event timer tick
or every 3rd hz tick or every 4th stathz ticks.  So what I said above
is not quite correct.  It takes a malicious process to hide from
statclock by staying in phase intentionally.  The security hole is just
denial of service and increasing charges for other processes.  I have
CPU to spare so I don't care about malicious process doing this.  But
I want statistics to work right for testing.  My test program was
somehow staying in phase unintentionally.  Also, once it is in phase
it can run without being charged for for 3 1/hz ticks, not just 1.

I think scheduling and statistics can mostly be fixed for non-malicious
processes by simple anti-aliasing of statlock.  With a periodic timer,
this means never running it in phase.  The above configuration is
especially simple.  The 4 statclock ticks must be distributed into
just 12 periods 0-11 without hitting the hz distribution of (0, 4, 8)
and there is essentially only 1 reasonable way to do this -- start at
2 to be in between 0 and 4, then use 5 since it doesn't alias, then
move to avoid 8 since it does alias -- it is only reasonable to go
1 higher or lower, say higher to 9 ---, then back to 11.  This
gives the distribution (2, 5, 9, 11).  To reduce bias, go the other
way for the next cycle: (2, 5, 8, 11).  To further reduce bias and
increase randomness, start the cycles at 1 or 3 instead of 2.  This
gives adjustments that are too large for accuracy, but not much more
can be done at a low frequency.

Similarly for non-periodic timers and old periodic timers.  Old periodic
timers never did this right, although stathz was introduced in response
the the old paper statclk-usenix93.ps which described how to do it
right at the time.  According to the paper, statclock needs to aperiodic
and very unpredictable, but FreeBSD only ever implemented a periodic
statclock.  Old versions of it mostly worked for non-malicious processes.
This depends on stathz being larger than the timeout granularity and/or
on having a high beat frequency.  This was the case when hz was 100 and
stathz was 128.  A short-running process will usually run for less than
1/stathz (more so in 1993 than now), so it won't be charged at all.  But
a long-running process or set of processes that doesn't intentionally
hide from statclock will tend to wake up at fairly random times by
interrupts, or at non-random times using timeouts.  The latter were
rarely in phase with statclock and didn't stay in phase on at least i386
since the hardware was different.

This doesn't for malicious processes, but with hz < stathz it was hard for
a malicious process to arrange for itself to keep awak often enough to
gain advantage.  To avoid being seen often by statclock, the process
must predict when a statclock tick will occur and go to sleep as late
as possible before it.  Then a large timeout granularity prevents the
process from waking up until much later using simple timeouts (other
wakeup events are much harder to control from userland -- you can't
easily generate them from another process without the other process
getting charged).  The best that can be done easily is:
- start with hz and stathz ticks in phase
- wake up on hz tick slightly after stathz ticks
- run for 1/stathz - epsilon, where 1/stathz should be significantly
   less than 1/hz, but was only a little less (1/128 < 1/100), and was
   broken by changing hz from 100 to 1000 (1/1000 is much less than 1/128).
- schedule timeout in 1/hz seconds (or better, use a repeating periodic
   to wake up every 1/hz seconds, except turn this off when it is too
   close to a stathz tick)
- with hz = 100 and stathz = 128, we get the duty cycle:
   run for 78 msec; sleep for 22 msec; run for 56 msec; sleep for 44 msec;
   ...  We can get nearly 7/8 of the CPU without being charged.
- with hz = 1000 and stathz = 128:
   run for 99 msec; sleep for 1 msec; run for 56 msec; sleep for 1 msec; ...
   We can get nearly 99/100 of the CPU without being charged.
- current with finer-grained timeouts supports the malicious process
   even better.

Malicious process can do a lot of evasion of even aperiodic statclocks
if the timeout granuarity is much smaller than the average statclock
period.  Whatever that average is, run for about half of it at a time
to give about a 50% chance of not being seen.  This can be improved a
bit running for longer immediately after being seen.  Perhaps more
importantly, an aperiodic statclock would have to be very aperiodic
to prevent its prediction by malicious process.  Its periods shouldn't
be clustered near the average since then it would be too predictable.
But that makes it give worse statistics for non-malicious processes.
Perhaps it would be a good kernel strategy to:
- use an almost periodic statclock for processes that appear to be
   running a lot
- use a very aperiodic statclock for processes that appear not to be
   running a lot
Tickless kernels almost give this as a side effect.

Now I describe some of the bugs in top that break observing this.  top
shows 2 CPU values, CPU and WCPU.  CPU is supposed to be the scheduler's
internal idea of the CPU usage and WCPU is supposed to a representation
of this that is more understandable by users.  These only make full
sense for SCHED_4BSD where they are different.  SCHED_ULE doesn't really
have either, and fakes both to the same value.  These values are often
confused even for SCHED_4BSD.  When a process starts up, its %CPU is
initially much smaller than 100% although the process is using 100% of
1 CPU.  This percentage must be multiplied by a weight 1/(1 - exp(-k*t))
to get 100%.  This gives %WCPU.  The weight decays rapidly to 1, so
%WCP == %CPU for most processes that live long enough to be displayed
by top and also to be interesting.

Some of the current bugs are:
- %WCPU is now unrelated to the scheduler.  It is (mis)calculated by top
   from the thread's runtime.  This gives weirdness like %WCPU for idle
   threads being quite different from %idle shown in the summary.  The
   former is more accurate.  The latter is maintained outside of the
   scheduler in the kernel but is tick-based like the scheduler.
- %CPU is only useful as a SCHED_4BSD internal, but the latter is no longer
   displayed
- for SCHED_ULE, the displayed %CPU == the displayed %WCPU as before.  These
   are now faked by top instead of by the scheduler.  I think ps still displays
   the kernel values.
- for SCHED_4BSD, the displayed %CPU == what the user is most interested in.
   It is some value calculated by top.  This value is quite different from the
   scheduler's raw %CPU and the scheduler's %WCPU
- for SCHED_4BSD, the displayed %WCPU is garbage.  It is the external value
   scaled by the internal weight 1/(1 - exp(-k*t)).  When t is small, this
   typically converts 100% into garbage like 2000%.
- although the thread's runtime is very accurate and should be nonnegative,
   the calculation apparently produces a small negative value sometimes.
   Then overflow bugs in scaling this to a percentage produces garbage
   like 92200000000000000% (this magic number is approx
   (uint64_t)INT64_MIN / 100).  This bug is more noticable on UP systems.
   Perhaps it only happens for SCHED_4BSD too.
- in one mode (I forget if it is -H or +H), the garbage percentages are
   clamped to 100%.  This hides the garbage, and also makes percentages
   thet are slightly miscalculated as a little above 100% look more accurate
   than they really are.  In the opposite mode, the garbage percentages are
   displayed.  Typically they are on a bunch of processes that were mostly
   idle.

Now I describe some anomalies related to load averages:
- on an 8-core system that is mostly idle, with the buggy top (-SH), usually
   many of the cpuN-idle percentages are in the high 90's, but these are not
   nearly high enough, and the others oscillate wildly (mostly below 100%).
   %idle in the summary is stable at 100%.  The former is hard to explain.
   I think all the runtimes are very accurate.  freefall with a small load
   average doesn't oscillate.  I suspect the difference is that my hz = 100
   makes top's bugs larger here as well as elsewhere.
- after upgrading to an old version of top, all the cpuN-idle percentantages
   are stable at 99.02%.  But this is wrong too.  The summary idle remains
   stable at 100%, and I'm sure that 100% is more accurate than 99.02%.
   All these percentages are tick-based.  99 looks like an off-by-1 error
   with the error noticable because hz = 100.
- the system had a high load average just before it became idle.  It seemed
   to take too long to decay.  Perhaps counting 100% idle as 99.02% does
   that.
- another (mostly more general problem) is that only SCHED_4BSD keeps long-
   term history for individual threads.  I'm not sure of the load average
   decay is the same for the 2 schedulers.  Anyway, for individual threads,
   we expect %WCPU to be very stable for long-running threads with
   SCHED_4BSD, since it takes a long time to recover from the weight of
   history.  The user might prefer to see the transient load, but this is
   not the best default since it makes the top threads change more.

Bruce

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20151027203059.M4268>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation