Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 18 Jul 2012 23:47:25 +0300
From:      Alexander Motin <mav@FreeBSD.org>
To:        Ryan Stone <rysto32@gmail.com>
Cc:        FreeBSD Hackers <hackers@freebsd.org>
Subject:   Re: ULE scheduler miscalculates CPU utilization for threads that run in short bursts
Message-ID:  <500720DD.8090605@FreeBSD.org>
In-Reply-To: <CAFMmRNwN9dDp2dHwSZ7p=vkdhppyss=Hyn8wpSwu-SgjYyUd2w@mail.gmail.com>
References:  <CAFMmRNwN9dDp2dHwSZ7p=vkdhppyss=Hyn8wpSwu-SgjYyUd2w@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On 18.07.2012 23:29, Ryan Stone wrote:
> At $WORK we use a modification of DEVICE_POLLING instead of running
> our NICs in interrupt mode.  With the ULE scheduler we are seeing that
> CPU utilization (e.g. in top -SH) is completely wrong: the polling
> threads always end up being reported at a utilization of 0%.
>
> I see problems both with the CPU utilization algorithm introduced in
> r232917 as well as the original one.  The problem with the original
> algorithm is pretty easy to explain: ULE was sampling for CPU usage in
> hardclock(), which also kicks off the polling threads, so samples are
> never taken when the polling thread was running.
>
> It appears that r232917 attempts to do time-based CPU accounting
> instead of sampling based.  sched_pctcpu_update() is called at various
> places to update the CPU usage of each thread:
>
> static void
> sched_pctcpu_update(struct td_sched *ts, int run)
> {
>          int t = ticks;
>
>          if (t - ts->ts_ltick >= SCHED_TICK_TARG) {
>                  ts->ts_ticks = 0;
>                  ts->ts_ftick = t - SCHED_TICK_TARG;
>          } else if (t - ts->ts_ftick >= SCHED_TICK_MAX) {
>                  ts->ts_ticks = (ts->ts_ticks / (ts->ts_ltick - ts->ts_ftick)) *
>                      (ts->ts_ltick - (t - SCHED_TICK_TARG));
>                  ts->ts_ftick = t - SCHED_TICK_TARG;
>          }
>          if (run)
>                  ts->ts_ticks += (t - ts->ts_ltick) << SCHED_TICK_SHIFT;
>          ts->ts_ltick = t;
> }
>
> The problem with it is that it only seems to work at the granularity
> of 1 tick.  My polling threads get woken up at each hardclock()
> invocation and stop running before the next hardclock() invocation, so
> ticks is (almost) never incremented while the polling thread is
> running.  This means that when sched_pctcpu_update is called when the
> polling thread is going to sleep, run=1 but ts->ts_ltick == ticks, so
> ts_ticks is incremented by 0.  When the polling thread is woken up
> again, ticks has been incremented in the meantime and
> sched_pctcpu_update is called with run=0, so ts_ticks is not
> incremented but ltick is set to ticks.  The effect is that ts_ticks is
> never incremented so CPU usage is always reported as 0.
>
> I think that you'll see the same effect with the softclock threads, too.

That is obvious that it is impossible to measure pctcpu for hardclock - 
synchronized threads using hardclock as the only time source. Mentioned 
change made things neither more nor less broken then they already were.

> I've experimented with reverting r232917 and instead moving the
> sampling code from sched_tick() to sched_clock(), and that seems to
> give me reasonably accurate results (for my workload, anyway).

That approach will fix pctcpu accounting for thread synchronized to 
hardclock, but I worry can make worse for threads switching context 
around statclock.

> The
> other option would be to use a timer with a higher granularity than
> ticks in sched_pctcpu_update().

That would be great it we had reliable and cheap timers on x86. Now ones 
that are fast (TSC) are unreliable, others that are reliable are too 
slow for this use.

-- 
Alexander Motin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?500720DD.8090605>