Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 20 Sep 2002 14:08:05 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Rik van Riel <riel@conectiva.com.br>
Cc:        Julian Elischer <julian@elischer.org>, Bill Huey <billh@gnuppy.monkey.org>, freebsd-arch@freebsd.org
Subject:   Re: New Linux threading model
Message-ID:  <3D8B8E35.EDAF4450@mindspring.com>
References:  <Pine.LNX.4.44L.0209201652330.1857-100000@imladris.surriel.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Rik van Riel wrote:
> On Fri, 20 Sep 2002, Julian Elischer wrote:
> > On Fri, 20 Sep 2002, Rik van Riel wrote:
> > > There don't seem to be any O(n) loops left in or near this scheduler,
> > > meaning that 1:1 threading with lots of threads becomes possible.
> >
> > The FreeBSD scheduler is moving towards a big rewrite but we want to
> > change "one thing at a time" :-) in that area..
> 
> This is doable in a smallish number of steps, which don't
> even need to be done in this order:
> 
> 1) per-cpu runqueues instead of a global one, which wants ...

Yes.

> 2) ... load balancer between these per-cpu queues

No.

This is the problem with the Linux model.  With an active load
balancing algorithm, you end up with global contention for
scheduler queue locks.

In reality, you can get away from having *any* scheduler queue
locks *at all*, for the normal case, and then only contend at
the inter-CPU level for the CPUs using a push migration model.

This basically depends on being able to do a "lockless list is
empty test", which you can do with a simple pointer compare.

The design is that the per CPU scheduler checks the push list
for processed which it needs to account for in its scheduler,
but which are not in the per CPU scheduler queue yet.  If the
pointer is non-NULL, then it locks access to the pointer, and
drains the list into the local scheduler queue.

For the push case, the process(es) to be migrated off the
current CPU are selected, and then, based on a figure of merit
which is only ever written by the CPU to which it applies, and
which is atomically readable by all other CPUs, such that no
locking is required, a "target" CPU is identified.  The pointer
lock for that CPU is acquired (notice that this is not in the
global lock contention domain; it only involves the two CPUs in
question), and the preassembled list is added to the list of the
target CPU.  In the normal course of events, this changes a value
from NULL to a list of processes to move into the target
scheduler queue; worst case, the CPU pushing has to assign the
contents of a pointer to the pointer to the next entry in the
last entry of the pushed list to the protected pointer, and the
protected pointer t the head.  In practice, the number of processes
being migrated will never be more than 1, unless you write a
replacement scheduler, which is migration happy.

So the migrate away is one-behind the head of the scheduling queue,
and the migrate-to is one ahead.  On migration, an artificial
inflation of the figure of merit can be handled -- or a the value
of the pointer can be examined, and assumed to add a constant
weighting to the target CPU's figure of merit, if the pointer value
is non-NULL.

Thus, it is only ever required to lock when you are actively doing
process migration, and process migration is rare (as it should be,
particularly if one of  your target architectures is NUMA, but in
the general case on small CPU count shared memory multiprocessors,
as well).

This also permits preference weighting based on locality, for
affinity on hyper-threaded CPUs, and negaffinity in CPU sets under
the same circumstances.



> 3) two runqueue arrays (current and expired) instead of
>    just one, which enables ...

Not required.

> 4) ... event-driver priority recalculation, instead of
>    recalculating the priority of each task separately

This actually doesn't work.  The worst case failure is under
overload, which is exactly where you don't want it to be.

The scheduling for the BSD scheduler, as was pointed out, takes
time not run into the priority weighting.  A granularity of 3
seconds until the disctinction between the two queues for
enqueueing delayed jobs is realized is really gross.  8-(.


> These changes are probably small enough that they can be done
> without the risk of destabilising anything.

That's certainly true, regardless of the implementation.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3D8B8E35.EDAF4450>