Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 18 Jul 2012 16:29:38 -0400
From:      Ryan Stone <rysto32@gmail.com>
To:        FreeBSD Hackers <hackers@freebsd.org>
Cc:        Alexander Motin <mav@freebsd.org>
Subject:   ULE scheduler miscalculates CPU utilization for threads that run in short bursts
Message-ID:  <CAFMmRNwN9dDp2dHwSZ7p=vkdhppyss=Hyn8wpSwu-SgjYyUd2w@mail.gmail.com>

next in thread | raw e-mail | index | archive | help
At $WORK we use a modification of DEVICE_POLLING instead of running
our NICs in interrupt mode.  With the ULE scheduler we are seeing that
CPU utilization (e.g. in top -SH) is completely wrong: the polling
threads always end up being reported at a utilization of 0%.

I see problems both with the CPU utilization algorithm introduced in
r232917 as well as the original one.  The problem with the original
algorithm is pretty easy to explain: ULE was sampling for CPU usage in
hardclock(), which also kicks off the polling threads, so samples are
never taken when the polling thread was running.

It appears that r232917 attempts to do time-based CPU accounting
instead of sampling based.  sched_pctcpu_update() is called at various
places to update the CPU usage of each thread:

static void
sched_pctcpu_update(struct td_sched *ts, int run)
{
        int t = ticks;

        if (t - ts->ts_ltick >= SCHED_TICK_TARG) {
                ts->ts_ticks = 0;
                ts->ts_ftick = t - SCHED_TICK_TARG;
        } else if (t - ts->ts_ftick >= SCHED_TICK_MAX) {
                ts->ts_ticks = (ts->ts_ticks / (ts->ts_ltick - ts->ts_ftick)) *
                    (ts->ts_ltick - (t - SCHED_TICK_TARG));
                ts->ts_ftick = t - SCHED_TICK_TARG;
        }
        if (run)
                ts->ts_ticks += (t - ts->ts_ltick) << SCHED_TICK_SHIFT;
        ts->ts_ltick = t;
}

The problem with it is that it only seems to work at the granularity
of 1 tick.  My polling threads get woken up at each hardclock()
invocation and stop running before the next hardclock() invocation, so
ticks is (almost) never incremented while the polling thread is
running.  This means that when sched_pctcpu_update is called when the
polling thread is going to sleep, run=1 but ts->ts_ltick == ticks, so
ts_ticks is incremented by 0.  When the polling thread is woken up
again, ticks has been incremented in the meantime and
sched_pctcpu_update is called with run=0, so ts_ticks is not
incremented but ltick is set to ticks.  The effect is that ts_ticks is
never incremented so CPU usage is always reported as 0.

I think that you'll see the same effect with the softclock threads, too.

I've experimented with reverting r232917 and instead moving the
sampling code from sched_tick() to sched_clock(), and that seems to
give me reasonably accurate results (for my workload, anyway).  The
other option would be to use a timer with a higher granularity than
ticks in sched_pctcpu_update().



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAFMmRNwN9dDp2dHwSZ7p=vkdhppyss=Hyn8wpSwu-SgjYyUd2w>