Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 20 Sep 2002 17:40:29 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        "Bill Huey (Hui)" <billh@gnuppy.monkey.org>
Cc:        Rik van Riel <riel@conectiva.com.br>, Julian Elischer <julian@elischer.org>, freebsd-arch@freebsd.org
Subject:   Re: New Linux threading model
Message-ID:  <3D8BBFFD.E1CDEAD5@mindspring.com>
References:  <Pine.LNX.4.44L.0209201652330.1857-100000@imladris.surriel.com> <3D8B8E35.EDAF4450@mindspring.com> <20020920213629.GA1527@gnuppy.monkey.org>

next in thread | previous in thread | raw e-mail | index | archive | help
"Bill Huey (Hui)" wrote:
> > No.
> >
> > This is the problem with the Linux model.  With an active load
> > balancing algorithm, you end up with global contention for
> > scheduler queue locks.
> 
> That's not true, they're all per CPU run queue locks.

The locks are in the global contention domain; basically,
they have to be shared among all processors, unless they
are in uncacheable pages.  Any contended resources are going
to be effectively global, at least if they are contended by
multiple CPUs.


> > In reality, you can get away from having *any* scheduler queue
> > locks *at all*, for the normal case, and then only contend at
> > the inter-CPU level for the CPUs using a push migration model.
> 
> That's what the Mingo's scheduler does. It only does a double lock
> when the load balance computation makes it do a CPU migrate, otherwise
> there's no contention and no per CPU locks are aquired.

This isn't true, if a double-lock is involved.  The issue is one
of whether or not you need to acquire a lock on your local scheduler
queue in order to use it, or not.  You're saying that you have to
acquire the lock, which means that you effectively have to do it in
a globally contended space.

In the design I'm talking about, there are is only a single lock,
and that lock is on a migration queue, into which you *push*
entries.  This is different from both the Linux scheduler, and the
current FreeBSD scheduler, and the current FreeBSD scheduler with
Alfred's old (last year) affinity patches applied.

Bu only scheduling out of your own queue, and only pushing from
your own queue to a per CPU migration queue, you guarantee *one*
lock on the migration, and you guarantee not needing *any* locks
for your own queue access.

Even if you place the scheduler locks in uncontended pages, and
eat a factor of 5 access penalty for the barrier, having to get
your own scheduler lock -- or some other CPU's -- leaves you in
the position of having to eat a huge amount of overhead in the
common (non-migration) case.

> I don't normally ask this, but what the hell are you saying here ?

Here's pseudo-code:

	/* handle migration *to* us */
	IF !empty(my_migration_queue)
		LOCK(my_migration_queue)
		while( !empty(my_migration_queue))
			tmp = my_migration_queue
			my_migration_queue = tmp->next
			insert(my_scheduler_queue, tmp)
			/* recalculate my load figure of merit */
			/* update our figure of merit */
		UNLOCK(my_migration_queue)
	ENDIF

	/* get next scheduled process to run for us */
	next_run = head(my_scheduler_queue)
	remove(my_scheduler_queue, next_run)

	/*
	 * compare our load figure of merit vs. all other CPUs to decide
	 * whether to migrate a process from us to another CPU
	 */
	/* ...find lowest figure of merit of all neighbors... */
	/* ...compare our figure of merit minus a low watermark to that... */
	IF should_migrate
		migrate = head(my_scheduler_queue)
		remove(my_scheduler_queue, migrate)
		LOCK(target_migration_queue)
		migrate->next = target_migration_queue
		target_migration_queue = migrate
		UNLOCK(target_migration_queue)
		/* update our figure of merit */
	ENDIF

	/* ...run "next_run" ... */

...the local CPU scheduler queue is not a contended resource.  Ever.

Further, you are permitted any amount of complexity in your target
choice you want.  This includes preferring "hyperthreaded" CPUs on
the same die for affinity's sake (to avoid cache shootdown), and
preferring other physical CPUs for negaffinity's sake (to avoid stall
barriers that effect more than one thread simultaneously, as would be
the case on a hyperthreaded CPU running a process with two threads in
a box with 2 real CPUs with 2 hyperthreded "CPUs" each).


> > > 3) two runqueue arrays (current and expired) instead of
> > >    just one, which enables ...
> >
> > Not required.
> 
> That's part of mingo's algorithm to avoid recalculation if I understand
> it correctly. Not exactly sure, I'm stretching my knowledge of his
> algorithm here.

Yes, it's his recalculation avoidance, in order to make the scheduler
"O(1)".  The real effect, though, is to cause starvation to be possible
under heavy load, at which point things which would have been scheduled
preferentially are scheduled round-robin in the other queue.  That this
only triggers when the other queue head (what I would call "the full
quantum queue", as opposed to "the partial quantum queue") is waiting
"too long" is actually part of the problem, as I see things.


> > > 4) ... event-driver priority recalculation, instead of
> > >    recalculating the priority of each task separately
> >
> > This actually doesn't work.  The worst case failure is under
> > overload, which is exactly where you don't want it to be.
> 
> What kind of overload ? he does a number of things to make sure that
> all processes behave properly by demoting priorities.

And by requeueing them into the full quantum queue, rather than
the "partial quantum queue", if the number of things which need to
run end up exceeding an arbitrary pool retention time on the "full
quantum queue".

This basically results in mathematical artifacts, from treating a
step function as if it were, in the limit, equivalent to a function
which was continuous.  8-).

The main thing this will do is, when the load gets high enough that
this happens every time, the system will degrade to the same behaviour
as if a "partial quantum queue" didn't exist.  Or rather, that
behaviour, plus an additional lock, plus an empty queue examination,
plus an unlock.

The other thing it will do is cause the artifacts, which will result
in unexpected behaviour near the boundary cases.  The worst case for
this is ping-ponging between the queues for an interactive process,
e.g. a game or an MP3 player or an auidio encoding process or a CD or
DVD burner, etc., while it's sitting right at the boundary.  This will
not hurt it on average, but it will toggle between getting full vs.
partial quantums (for example), which could make things, uh... "bursty".
;^).

This degraded behaviour will be exactly equal to a complete lack of
thread group affinity.

Actually, the "partial quantum" case can only ever be 66% effective
in any event, as a statistical probability of the best case that
still exhibits contention: it's intent is really to ensure preferential
scheduling of processes which are ready-to-run, and which have partial
quantum remaining.  This actually fails, when the processes have
partial quantum remaining... but aren't ready to run.  The assumption
implicit in this is that the average blocking operation will take no
more than 50% of the duration of a quantum in order to complete in the
background.  Anything longer than that, and the scheduling degrades,
again, to round-robin.

This is actually the primary reasoning behind my design of a test case
for a benchmark, vs. a microbenchmark: the algorithm that's there will
work *very well* for an uncontended system running a single multithreaded
process, but degrade significantly under all other situations (IMO).


I'm pretty sure that Peter and Matt will not let FreeBSD blindly
implement the same model, without at least running the numbers on it,
or the model I've suggested, or *any* model, for that matter.  FreeBSD
is pretty technically conservative, which, at times like this, is
usually a good thing.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3D8BBFFD.E1CDEAD5>