From owner-freebsd-arch  Mon Dec 13 18:38:50 1999
Delivered-To: freebsd-arch@freebsd.org
Received: from ns1.yes.no (ns1.yes.no [195.204.136.10])
	by hub.freebsd.org (Postfix) with ESMTP id 2559B152D9
	for <freebsd-arch@freebsd.org>; Mon, 13 Dec 1999 18:38:48 -0800 (PST)
	(envelope-from eivind@bitbox.follo.net)
Received: from bitbox.follo.net (bitbox.follo.net [195.204.143.218])
	by ns1.yes.no (8.9.3/8.9.3) with ESMTP id DAA20382
	for <freebsd-arch@freebsd.org>; Tue, 14 Dec 1999 03:38:46 +0100 (CET)
Received: (from eivind@localhost)
	by bitbox.follo.net (8.8.8/8.8.6) id DAA59606
	for freebsd-arch@freebsd.org; Tue, 14 Dec 1999 03:38:45 +0100 (MET)
Received: from smtp05.primenet.com (smtp05.primenet.com [206.165.6.135])
	by hub.freebsd.org (Postfix) with ESMTP id 12E68152D9
	for <arch@FreeBSD.ORG>; Mon, 13 Dec 1999 18:38:24 -0800 (PST)
	(envelope-from tlambert@usr08.primenet.com)
Received: (from daemon@localhost)
	by smtp05.primenet.com (8.9.3/8.9.3) id TAA09051;
	Mon, 13 Dec 1999 19:37:54 -0700 (MST)
Received: from usr08.primenet.com(206.165.6.208)
 via SMTP by smtp05.primenet.com, id smtpdAAAZEaGyr; Mon Dec 13 19:37:42 1999
Received: (from tlambert@localhost)
	by usr08.primenet.com (8.8.5/8.8.5) id TAA01908;
	Mon, 13 Dec 1999 19:37:47 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199912140237.TAA01908@usr08.primenet.com>
Subject: Re: Recent face to face threads talks.
To: eischen@vigrid.com (Daniel M. Eischen)
Date: Tue, 14 Dec 1999 02:37:47 +0000 (GMT)
Cc: adsharma@sharmas.dhs.org, arch@freebsd.org
In-Reply-To: <3853F3C0.EE96FB16@vigrid.com> from "Daniel M. Eischen" at Dec 12, 99 02:13:04 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

> > > Wouldn't this be akin to what other systems call a kernel thread or LWP?
> > 
> > A kernel thread or a LWP has a 1-to-1 relationship with a kernel stack.
> > In our discussion, a KSE has a 1-to-1 relationship with a kernel stack. The
> > purpose of the Q structure was to enforce "scheduling classes" - to
> > represent a set of threads having the same priority and to ensure
> > fairness in the scheduler.
> > 
> > Apart from this distinction, KSE is the closest to a kernel thread/LWP
> > and could possibly be used to implement blocking interrupt threads.
> 
> A KSE can't be placed in the run queue (or isn't in our implementation).
> A LWP or kernel thread can be.  A LWP can have it's own priority.
> I'd argue that our Q much closer resembles a LWP than does a KSE.

This is probably more detail than is appropriate, since we agreed
to defer the seperation of the "Q" until later, but...

"Q" is a structure that acts as a scheduler reservation.  It's
purpose is to create an association between KSE's, a process,
and a quantum at a certain priority.  It also provides the basis
for getting multiple processors into user space at the same time.

In a multithreaded process on an SMP machine, where all threads
are at the same priority, you will normally have one "Q" per CPU
per multithreaded process, unless you have more CPUs than threads,
in which case youll have a number equal to the number of threads.

A traditional UNIX process is the same as one "Q" and one KSE.

The "Q" is the structure against which quantum is accounted, and
is the container for the priority of the process.


A "Q" is a close approximation of the Solaris "LWP", but it is
not the same, and using the same terminology should probably be
avoided, since it will be confusing.  The difference is that an
LWP is a kernel thread in the strictest formal sense, and it is
an explicit backing object, and is mapped to one or more user
space processes.

Solaris as of 2.5 made this mapping less explicit, by allocating
another LWP to run the UTS (userspace threads scheduler) when the
kernel ran out of LWPs, but there wre more user space threads
that were runnable.  This extra LWP signalled the UTS, which would
then call down to on-demand create more backing LWPs for the runnable
user space threads.

As an unfortunate historical note, SunOS 4.1.x supported something
called "liblwp", which was effectively a wrapper for turning a
blocking I/O call into a non-blocking I/O call plus a threads
context switch.  Sun added aio_read/aio_write/aio_wait/aio_cancel
system calls, as well as a helper system call for dealing with
the user stack, and which knew about SPARC register windows; the
University of Washingtom technical report "User space threads and
SPARC Register Windows" (1992) describes the implementation.


In terms of the scheduler, "Q"s go onto run queues, while KSEs go
onto sleep queues.


KSEs have become what I would call an async system call blocking
call context, or "blocking call context" -- BCC, for short.
They are intended to be allocated out of a pool, or preferrably,
out of a per CPU resource pool, so as to avoid requiring The Big
Giant Lock(tm), except when draining or refilling the pool at the
high water mark or low water mark on the pool, respectively (this
is precisely what Dynix did to achive much higher granularity,
at the cost of maintaining a two-layer malloc() system, with the
possibility of some memory being "lost" to the non-empty pools;
it is a brilliant design, IMO).

KSEs are container objects for sleeping on their addresses and
for the kernel stack per outstanding blocked system call.  The
stack _may_ be given back prior to the KSE itself becoming
unreferenced.  A KSE must also have queue pointers for putting
it on the sleep queues, and for maintaining a linked list in
the pools.

Editorial note:	A "Q" is what I would have called the KSE.  A
		kernel schedulable entity implies an interaction
		with the scheduler, but we are probably going to
		get stuck with the current definitions.  If so,
		I suggest the term SR for "Scheduler Reservation"
		is a better fit.


In terms of the scheduler (also deferred work), the main idea
is to have a per-CPU scheduler that grabs a per-CPU scheduler
lock each time it wants to manipulate the data.

When migrating "Q"s between CPUs, the CPU migrating will have
to grab both its own scheduler lock ("migrate from") and the
target CPUs scheduler lock ("migrate to").

Common use would be grabbing its own lock, not in the same
contention domain as the other CPUs, so there would be no
stalling, as there is with The Big Giant Lock(tm).

Like the per processor KSE pools, this enables great concurrency:
a migration of a "Q" from one CPU to another on an 8 CPU system
will utilize 1 CPU, potentially stalling another CPU, and
leaving 6 CPUs unimpacted.


There are only three cases where a "Q" is created:

o	On fork(2) (obviously).

o	When asking to be scheduled on additional CPUs in an
	SMP system.  It is likely that this particular case
	will not require priviledges, or that soft limits
	requring priviledges to violate will be implemented
	(e.g. an administrative limit of reservations on 4
	CPUs in a 16 CPU system, controlled by the users
	login class).

o	When asking for an additional "Q" in order to effect
	sheduling priorities of PTHREAD_SCOPE_SYSTEM.  It is
	likely that dropping priority will be unpriviledged,
	and merely result in a UTS designation change, since
	creating an additional "Q" would permit the process
	to compete as multiple processes against other processes
	for quantum (in other words, it would pretend to run the
	affected threads at a lower system priority, but in
	reality, would not).  Raising priority is already a
	priviledged instruction; it is assumed that raising
	priority, by virtue of the fact that it requires
	priviledge, will in fact create another "Q".  To be
	able to effectively manage multiple priority scheduler
	reservations, except in the face of priority lending
	to deal with inversion, there is a new requirement for
	an explicit yield(2) call to give unused higher priority
	quantum back to the system.

Because CPU migration is an exceptional event for a process, and
is a matter of kernel scheduler policy, there is natural CPU
affinity for each "Q" in this approach.  This is many times
better than the current situation in both FreeBSD and Linux,
as I think some experiments by Mike Smith would support.

As a simple algorithm for achieving initial balancing, adding:

	u_int32_t	p_cpumask;	/* CPUs with scheduler reservations*/
	u_int32_t	p_Q_count;	/* number of "Q"s in process*/

To the proc structure, and then setting a bit for each CPU that
there is currently a "Q" outstanding on (and moving the bit if
it gets migrated) will ensure that each new "Q" can be created
on a different CPU.  Counting the bits and keeping a running "Q"
count is a cheap way of knowing if multiple "Q"s are on the same
CPU as a result of migration; you would need to traverse the list
in order to identify and move these around between CPUs, assuming
a load balancing vs. concurrency algorithm in the kernel scheduler.


> > It was also agreed that it doesn't make sense to do everything at once,
> > but do it in some sensible order, without precluding elements of the
> > design. Some of the tasks - break the proc structure into a proc and KSE,
> > per CPU scheduling queues, implementing SA.

My own take on this was that implementing the scheduler changes
would be a heck of a lot easier than the other work.  8-).  On
the other hand, the general consensus was that it would be easier
to do the other work than the scheduler changes, so the scheduler
changes are going to need to wait.


> > My personal opinion is that it might be easier to go in steps -
> > rfork/clone based implementation (aka linuxthreads), then a
> > Solaris like m x n implementation and then scheduler activations.
> > Implementation wise, they're in the increasing order of difficulty.
> > From what I heard from Terry, this is also the path that Solaris
> > took. This also lets application writers pick the right model for
> > their work.
> 
> I dunno.  It seems easier to me to implement SA from the start,
> concentrating on eliminating much of the extraneous code in libc_r.
> I'd start with making it only work for one process or Q.  The
> userland part of this seems really easy to me.  Once that works,
> I'd add the ability to have multiple Q's.

This is what looked right to me, and I think that Jason Evans
agreed.  The intent was to not break out "Q" until later, as
part of the scheduler work.

For one thing, we already have the Linux threads implemented
(it can be loaded as a kernel module).  For another, kernel
threads backing user space threads is a horrible implementation,
and has all sorts of problems with thread group (process) affinity
when deciding who to run next on a partially consumed quantum.
Adding thread group affinity to a kernel scheduler is a sure way
to shoot your leg off.

I think that this is at least natural, if not the order I would
choose.  Much of the hardest work is allowing the process to be
on a multiple run queues (or the same run queue, multiple times).
This work could even be debugged on a UP machine, where you
stuff the same process into a single scheduler several times
(indeed, this is necessary for PTHREAD_SCOPE_SYSTEM thread
priorities on a uniprocessor machine).

This seems to me to most resemble the shared library issues,
where it's better to do it right the first time than it is to
take a bogus implementation (e.g. the System V shared library
style implementation, which ate a chunk of address space for
each version of a single library, and the same for each library,
as they had to be non-PIC code mapped to the same location in
all processes).  Waiting on Jeffrey Hsu's patches for PIC in
GCC (actually, released in support of LKMs, which needed PIC)
was The Right Thing To Do(tm).


> > If we stick to the POSIX threads interface strictly, we shouldn't be
> > afraid of getting stuck with one of the implementations.
> > 
> > Other concerns that were expressed - crossing protection boundaries too
> > often to pass messages about blocking and rescheduling. Matt Dillon
> > responded that these messages could be batched and made more efficient.
> 
> I agree.  It seemed like the SA paper went to extremes in notifying
> the UTS of unblocked threads.  I think we can do this at more opportune
> times, like when a Q is resumed or on a user->kernel crossing.  I don't
> know that we want to be interrupting a Q while it is running a thread
> just to notify the UTS that another thread has unblocked in the kernel.

Julian had a nice discussion with Jason (and later Matt and me) on
this topic.  The basic idea was to run the threads up to the user/kernel
boundary, but not farther.  At that time, the call has effectively
returned all but the CPU on which it is running to user space, and
a nice optimization is that the kernel stack associated with the KSE
can be released.  This would allow hundreds of thousands of the
things, a possibility on a heavily loaded system.

Matt suggested a shared memory segment, rather than a message,
where the status could be reflected to user space.  This strikes
me as perhaps premature optimization, but would give us a place
to put information about the "Q" of a process returning to user
space, which is something that the UTS needs in order to allow
it to seperate quanta of different priority for assignment to a
new user space thread (or an explicit yield(2), if the high priority
threads are all blocked -- though a priority lend to the lower
priority thread holding the resource would probably be a better
idea, it will require additional work in user space to implement).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message