From owner-freebsd-current  Sat Feb  1 18:32: 6 2003
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 0D54837B406
	for <freebsd-current@freebsd.org>; Sat,  1 Feb 2003 18:32:01 -0800 (PST)
Received: from heron.mail.pas.earthlink.net (heron.mail.pas.earthlink.net [207.217.120.189])
	by mx1.FreeBSD.org (Postfix) with ESMTP id BD11943F3F
	for <freebsd-current@freebsd.org>; Sat,  1 Feb 2003 18:31:59 -0800 (PST)
	(envelope-from tlambert2@mindspring.com)
Received: from pool0036.cvx40-bradley.dialup.earthlink.net ([216.244.42.36] helo=mindspring.com)
	by heron.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128)
	(Exim 3.33 #1)
	id 18f9vB-0007Lh-00; Sat, 01 Feb 2003 18:31:50 -0800
Message-ID: <3E3C82B9.88740A1B@mindspring.com>
Date: Sat, 01 Feb 2003 18:30:17 -0800
From: Terry Lambert <tlambert2@mindspring.com>
X-Mailer: Mozilla 4.79 [en] (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Matthew Dillon <dillon@apollo.backplane.com>
Cc: Bosko Milekic <bmilekic@unixdaemons.com>,
	"Daniel C. Sobral" <dcs@tcoip.com.br>,
	Trish Lynch <trish@bsdunix.net>, freebsd-current@FreeBSD.ORG
Subject: Re: Hyperthreading and machdep.cpu_idle_hlt
References: <20030131125804.E1357-100000@femme> <200301311824.h0VIOtmF095380@apollo.backplane.com> <3E3AC33E.9060204@tcoip.com.br> <200301311908.h0VJ8cNZ007396@apollo.backplane.com> <20030131141700.A7526@unixdaemons.com> <200301311952.h0VJqrMB076135@apollo.backplane.com> <20030201100412.B11945@unixdaemons.com> <3E3C327F.FD9E26F7@mindspring.com> <20030201160547.A13169@unixdaemons.com> <3E3C3C0D.15722918@mindspring.com> <200302012157.h11Lvo7f017280@apollo.backplane.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a40871b8a538799038e5c1b0d744966200387f7b89c61deb1d350badd9bab72f9c350badd9bab72f9c
Sender: owner-freebsd-current@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-current.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-current>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-current>
X-Loop: FreeBSD.ORG

Matthew Dillon wrote:
>     The HLT/clock interrupt issue is precisely what I describe in the
>     idle_hlt comments in i386/i386/machdep.c (last July).  I wish we had a
>     better mechanism then the stupid IPI stuff, like a simple per-cpu
>     latch/acknowledge level interrupt (softint), but we don't.

Thanks, Intel.  8-).


>     I don't think we want to over-engineer per-cpu scheduling.  The
>     system really doesn't know what cpu a task is going to wind up running
>     on until a cpu scheduler (sched_choose()) comes along and needs to
>     locate the next task to run.  Too many things can happen in between the
>     initiation of the wait, the wakeup, and the task actually getting cpu.
>     Introducing a complex per-cpu wait queue or trying to do something
>     complex at wakeup time instead of at sched_choose() time is just going
>     to be a waste of time.   I think it is best to wakeup a task by placing
>     it on the same cpu run queue as it was previously on (which is what Jeffs
>     code does for the most part),

The approach I've *always* advocated involves per-CPU run queues.

While it's possible to implement affinity and negaffinity with a
global queue, like Linux did, it's not really a good idea.  I know
that the current Linux code works, because Ingo Molnar put a huge
amount of effort into it.  However, the problem of affinity is not
NP-complete, and it's not really possible to deal with it with a
single queue.

Noting the process which is awoken gets on the same CPU ready-to-run
queue is just an obvious way of dealing with CPU affinity.  It fails
to deal with negaffinity, which is necessary for parallel execution,
and it fails to deal with the issue of eliminating locks: you still
get the TLB shootdowns, cache line invalidations, and inter-APIC
arbitration traffic, even if you don't issue an explicit IPI.


>     and deal with task stealing in
>     sched_choose().

A "task stealing" or "pull" approach to balancing is just plain
wrong.  It can't deal with transient load spikes, which should
be expected in any heterogeneous load situation.  This is the
reason you never see anyone running benchmarks on such a system,
without the system being otherwise idle, and without homogeneous
test loads: hetergenous test loads would fail to show "improvement",
due to excessive bouncing.

The other problem with a "pull" model is that it requires that the
per CPU scheduler queue be locked on all accesses, so a different
CPU can lock it to traverse it to "pull" new tasks for itself.

Finally, the "pull" model is very hard to deal with SMT, in terms
of making negaffinity decisions (you have to know too much), or
giving an intentional affinity bias (e.g. stay on the same physical
CPU, but go to "the other side" to avoid cache-busting).

By going to a push model, you limit locking to the migration code
path, and a per-CPU check for a non-NULL per-CPU migration list
head can be done without acquiring the lock that must be acquired
to pull data off the migration queue, or to push tasks off to
another CPU (locking the target's queue for the operation).  What
that means is no locking in the common case.  And no cache line
invalidation, and no TLB shootdown, and no inter-APIC arbitration.

Actually, it's very easy to compare these models implicitly, by
running a single CPU with SMT enabled, and comparing it to two
identical CPUs, with SMT disabled, such that the only real change
is the shared parts, and the lack of the cache, TLB, and APIC crap
in the first case.  With the inherent stalls from shared resources
in the SMT case, you'd expect a "true" SMP to have significantly
higher performance than SMT.  But that's not what you see, even
with Jeff's new scheduler.

>     The scheduler, when it comes time to actually switch
>     in the next runnable task, then deals with complexities associated with
>     misbalancing (i.e. cpu A is idle and ready to accept a new task, and
>     cpu B's run-queue has a task ready to be run).

Yes, currently, and this is wrong.  Or rather, the underutilized
CPU should not be doing the balancing, if it has to stall the most
heavily loaded CPU in order to do it... and therefore, "pull" is
wrong.

What *should* happen is that, at the time of the next context switch,
the heavily loaded CPU should decide to "shed load".  Proper hysteresis
really wants that decision to be made periodically, with a periodicity
equal to once per (N+1) entries into the per CPU scheduler, where N is
the number of CPUs participating in the SMP region in which tasks may
be migrated (as part of the strategy for avoiding excessive migration).

If this decision can be made on a single figure of merit, which we'll
call "load", and the figure is stored in an atomic (e.g. integer)
data area, then it can be read without locks.  Since it's only written
by the CPU for whom it's the figure of merit, again, no locks are
required to arbitrate the process of determining a load imbalace.


>     While it is true that we would like a cpu to predominantly use the
>     per-cpu run-queue that it owns, we don't really lose anything in the
>     way of performance by allowing cpu A to add a task to cpu B's
>     run queue or for cpu A to steal a task from cpu B's run queue.  Sure
>     we have the overhead of a per-cpu mutex, but the reason we don't lose
>     anything is that this sort of mechanism will *STILL* scale linearly
>     with the number of cpus in the system

The prblem is in the arbitration of the regions of memory in which
these locks exist.  In fact, the scaling is *not* linear with the
number of CPU's, since the arbitration overhead goes up exponentially
with the number of CPUs, when they share a contention domain.


>    (whereas the global run queue in
>     sched_4bsd.c constricts at a single sched_mtx and does not scale).

Preaching to the choir... we all know that a global queue, and the
global locks, suck.  But in truth, any locks suck, and replacing one
suckage with another is not really a useful exercise.  8-).


>     The
>     overhead of a per-cpu run-queue with a per-cpu mutex is *STILL*
>     effectively O(1) and the more complex overheads involved with locating
>     a new task to schedule from some other cpu's run queue when the current
>     cpu's run-queue is empty are irrelevant because you are only eating
>     into cycles which would otherwise be idle anyway.

It's not actually O(1).

The problem is that the lock arbitration overhead grows arbitrarily
large.  Also, the migration can not be between nodes in a cluster,
or non-local CPU's in a NUMA system, because the scheduler must be
able to access the scheduling queue of the CPU from which it wishes
to be able to "pull" tasks, and it must hold a lock on that queue
over the process.

Given a "push" model, this isn't a problem, since pushing a task is,
effectively, a mesaging operation, and it could go so far as to
suspend the task, move it to memory in another node entirely, and
resume it by placing it in the scheduler queue on that node.

As an old Amiga guy (;^)), you should be all for messaging, right?
8-).


-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message