Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 4 Dec 1996 15:00:18 -0700 (MST)
From:      Terry Lambert <terry@lambert.org>
To:        cracauer@wavehh.hanse.de (Martin Cracauer)
Cc:        terry@lambert.org, julian@whistle.com, cracauer@wavehh.hanse.de, nawaz921@cs.uidaho.EDU, freebsd-hackers@freebsd.org
Subject:   Re: clone()/rfork()/threads (Re: Inferno for FreeBSD)
Message-ID:  <199612042200.PAA17332@phaeton.artisoft.com>
In-Reply-To: <9612041247.AA21652@wavehh.hanse.de> from "Martin Cracauer" at Dec 4, 96 01:47:57 pm

next in thread | previous in thread | raw e-mail | index | archive | help
> > > He might be correct.
> > > sharing memory spaces makes for a much smaller contect switch.
> > 
> > Assuming you switch from one context in a thread group to another.
> 
> > In which case, it is possible for a threaded process to starve all
> > other processes, depending on if its resource requests are satisfied
> > before all the remaining threads in the thread group have also made
> > blocking requests (otherwise, you are not prioritizing by being in
> > the thread group, and there are virtually no contex switch overhead
> > wins from doing the threading -- only the win that a single process
> > can compete for N quantums instead of 1 quantum when there are N
> > kernel threads in the thread group).
> 
> If I understand you right, your say that the sheduler (the one in the
> kernel, we don't have a userlevel sheduler in this model) must be
> changed to prefer to switch to a thread/process that has the same VM
> bits. If not, the number of cases where the VM space stays across the
> switch are too infrequent. The result would be that the theoretical
> advantage of threads, the faster context switch will be lost in
> practice.
> 
> Your concern is that this required change can lead to situations where
> one thread group takes over more resources than planned. Why that? Why
> can't the kernel keep track of the resources spent on one
> process/thread-group and act on that basis?

Because each kernel thread is a "kernel schedulable entity".  Quantums
are given to "kernel schedulable entities", they are not given to
thread groups.

For an analogy, do you want all of your FTP clients in a threaded FTP
server to compete per user for quantum, or do you want all N users to
compete for 1/N quantum with all other processes on the system (for
example: against a non-threaded HTTP server)?

This is a "fairness" issue.

If I, as the scheduler, have a "next quantum" to give away, fairness
dictates that I select the next process in the highest priority ready
to run queue to give it to.

Now in order to give it to a process other than this one, I have to
abridge fairness.

Say I have a process ("thread group") that consists of 3 kernel threads,
A, B, and C.

In the abridged scheduler, If A blocks, I prefer to give the next
quantum to kernel schedulable entities B or C instead of the kernel
schedualble entity at the front of the run queue which-may-not-be
B or C.

The same is true for the others:

	B blocks; prefer A or C
	C blocks; prefer A or B

It's possible that B is a "worker task".  Rather than doing I/O, it's
compute intensive... which is to say, rhather than making a system
call and blocking on resource availability ("voluntary context switch"),
the only way to kick B out is for it to consume its entire quantum.

B is always ready to run.

If in that time, the resource block on A or C is satisfied, then when B
quits, A or C gets the next quantum.  But when A or C quits with C or A
still blocked, B will *always* get the next quantum, since it is always
ready to run.

Now suppose:

	begin_thread_A()
	{
		while(1)
			continue;
	}

	begin_thread_B()
	{
		while(1)
			continue;
	}

No other kernel schedulable entities will ever run once the thread group
becomes active once.

Quantum counting doesn't work as a fix because there may be multiple
user space threads bound to a single kernel thread.  Also, I may
have a "treat this thread group as 30 quantums".  Doing this would
mean that after 30 quantums of the thead group monopolizing the
CPU, I forcibly exit the thread group preference cycle.  This would
result in potentially *very* jerky system response.

The natural amount of time you can monopolize the CPU without the
system becoming jerky is *defined* to be the quantum.  Giving away
1/3 of a quantum to a thread in the same thread group is *not* an
option (quantums don't work that way).

Now consider CPU affinity in an SMP environment.  A thread will have
an affinity for a CPU relative to the number of cached pages of that
thread in the given CPU.

Now consider that the purpose of kernel threading is to allow kernel
threads to be scheduled on different CPU's.   In other words, if A
has an affinity for CPU1, then B will have a negative affinity for
CPU1 (a positive affinity for CPUn, n!=1) proportional to A's affinity;
any other approach, and you will not achieve maximal concurrency in
the given application.

Suddenly, the scheduler becomes so complicated, that the benefit of
having multiple CPUs is diminished!


Preferential scheduling is a back hole.


> In any case, changing the sheduler to favor thread switches instead of
> process switches complicates things so that the implementation effort
> advantage a kernel-only thread solution is at least partially lost.

Not true.  It depends on how the usr and kernel stacks are implemented;
if they are implemented as large zones in the virtual address space,
with guard pages for stack grow, then the mappings need not be changes
on context switch.  The biggest problem will be on RISC architecures,
like SPARC, where register windows need to be flushed (see "SPARC
Register Windows and User Space Threading" for the University of
Washington CS dept... this is the basis of th SunOS 4.x liblwp threads
implementation).


> > A good thread scheduler requires async system calls (not just I/O)
> > and conversion of blocking calls to non-blocking calls plus a context
> 
> It does? [you probably know tyhe following, but for others]. 
> 
> All thread implementations I know details of that are not pure
> userlevel don't change system calls to non-blocking equivalents.

[ ... implementation instances elided ... ]

> The people I listend to so far were all convincend that turning all
> system calls into nonblocking versions will lead into serious
> implementation difficulties, especially if you take further changes
> into account (those will have to be made on two versions of the
> library).

Well, I don't agree with these people.  I think that turning any
blocking call into a thread context switch within a process group
(at best) or a full process context switch (at worst: and more
likely on a system doing something other than running a one process
threading benchmark) is a "serious difficulty".  8-).


> Another concern is that most exsiting async I/O interfaces
> don't work reliable.

This is a problem.  But it is easily resolved by making a seperate
trap gate, or adding a trap gate parameter, to allow all system
calls to be async.

The context is maintained in the trap gate, not in the calls, and
therefore there are no complications to deal with.

The main implementation hurdle is adding a flag to the systent[]
structure to tag potentially blocking calls vs. non-blocking calls.
Non-blocking calls run to completetion (always), and blocking calls
run to completion where they can, and block otherwise.

The way this is handled is by sleeping on a "going to block" flag
in the thread context in the non-blocking trap (or call gate) on
potentially blocking calls.  If the call is going to block, it
must do so with a standard system interface, like tsleep.  A tsleep
checks for a block on the call, and returns the context to user
space by clonging the context and putting the call to sleep.  The
flag subsequently notifies the wakeup to queue the response for
handing through an "aiowait" (for lack of a better name) interface.

Because it is handled via aiowait(), the wakup can be handled in
aqueued fashion in the same process context that made the call.
Thus the kernel threads which do the actual processing do not
have to have the process context available.  They do not care.

Since this is the case, the only issues are reentrancy (which we must
solve anyway) and kernel preemtion (which we must solve anyway, by
analogy, for CPU reentrancy of the kernel, since it is topologically
equivalent).  The kernel worker threads themselves can be kept in a
"work to do" pool.  Their contribution is the use of their kernel
stack.  Effectively:

	/*
	 * Call in trap gate to set up for return of potentially
	 * non-blocking call made with blocking interface
	 */
	aiosleep( sysent_entry, this_user_context)
	{
		some_kernel_thread = get_worker_thread();

		/* short duration sleep using lazy switching
		 * of non-kernel TLB mapping.  The switch is
		 * an explicit yield to our kernel worker thread.
		 * It will explicitly yield back to us.
		 */
		some_kernel_thread.proxy = this_user_context;
		sleep_and_return( this_user_context, some_kernel_thread);
	}

	/*
	 * call when operation must really block
	 */
	aioblock( some_kernel_thread)
	{
		/*
		 * Block this thread pending I/O completion and cause
		 * system call to return to user space aio scheduler.
		 */
		sleep_and_yield( some_kernel_thread, some_kerne_thread.proxy);
	}

	... etc. ...

Thus potentially blocking calls which do not block return immediately,
which potentially blocking calls which *do* block return a context
to the user space scheduler, which then does a user space context
switch to another thread.


In other words, if the system scheduler gives a quantum to a process,
that quantum belongs to that process, and the process is not forced
to give it back before it has used it up... which is one of the
points of threading: increased use of quantum with decreased context
switch overhead.




> I still like the simpliticity of a kernel-only thread solution. If
> that way turns out to be too inefficient, the DEC way seems to be a
> solution that doesn't need async system calls and has no efficiency
> disadvantage I can see (compared to a sysyem with async syscalls
> only).
> 
> I hope to get further details on the DEC implementation.

I hope so too; however, there is little difference between AST's with
context records and bocking->non-blocking conversion.


> > switch in user space (quasi-cooperative scheduling, like SunOS 4.1.3
> > liblwp).  This would result in a kernel thread consuming its full
> > quantum, potentially on several threads, before going on.  One of
> 
> I still don't know why we can't made the kernel keeping track of the
> timeslices spent on thread groups and shedule on that basis.

See above...

> > the consequences of this is that sleep intervals below the quantum
> > interval, which will work now, without a high degree of reliability,
> > will now be guaranteed to *not* work at all.  Timing on most X games
> > using a select() with a timeout to run background processing, for
> > instance, will fail on systems that use this, unless a kernel preempt
> > (a "real time" interrupt) is generated as a result of time expiration,
> > causing the select()-held process to run at the time the event occurs,
> > instead of simply scheduling the process to run.  This leads to buzz
> > loop starvation unless you limit the number of times in an interval
> > that you allow a process to preeempt (ie: drop virtual priority on
> > a process each time it preempts this way, and rest on quantum interval).
> 
> Another reason why I'd like to have only one sheduler (in the kernel).

No.  The fix for this problem is to make SysV systems meet SVID III(RT)
for setitimer/getitimer/gettimeofday/select.  Currently, most SVR4
systems do not meet the strict interpretation of the System V Interface
Definition because they state that they will function at system clock
frequency, but only implement on system clock global variable update
frequency.

For BSD, it means kernel preemption (treat an itimer/select timer
expiration event as an RT event, and make sure that the games have
RT class access -- even if they aren't RT programs themselves).

You probably want to do the same thing for mouse event handling via
moused, for the same reasons.  No matter ho loaded the system, you
want to guarantee that when you move the mouse, the pointer will
move.  Otherwise, you violate the proprioception assumptions which
make mouse/cursor interfaces work against human motor/cognitive
expectations ("I move my hand, the thing in my hand moves, the mouse
pointer moves, therefore the mouse pointer is the thing in my hand").


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199612042200.PAA17332>