Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 20 Sep 2002 11:03:07 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Daniel Eischen <eischen@pcnet1.pcnet.com>
Cc:        Bill Huey <billh@gnuppy.monkey.org>, freebsd-arch@FreeBSD.ORG
Subject:   Re: New Linux threading model
Message-ID:  <3D8B62DB.C27B7E07@mindspring.com>
References:  <Pine.GSO.4.10.10209200002280.2162-100000@pcnet1.pcnet.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Daniel Eischen wrote:
> On Thu, 19 Sep 2002, Bill Huey wrote:
> >       http://marc.theaimsgroup.com/?l=linux-kernel&m=103248252713576&w=2
> >       http://people.redhat.com/drepper/nptl-design.pdf
> 
> I read some of this and some of it is exactly opposite of why
> scheduler activations was made in the first place.  They are
> pushing all scheduling decisions and locking in to the kernel.
> One of the points of scheduler activations is that the library
> can make all scheduling decisions without need for having
> the kernel involved.

Please take these comments in the context of "what this paper means
to FreeBSD", rather than as a criticism of the Linux implementation;
I'll point out what I think are fallacies, when I see them, but the
main emphasis has to be "what does this mean to FreeBSD-arch"...


I find it incredibly amusing that in their 10 listed goals for
the system, the original design goal that resulted in the idea
of threads becoming widespread in the first place, the idea that
there would be less *runtime* overhead for using threads instead
of processes, seems to have been lost.  Now threads has other
goals, but hadware scaling and NUMA support didn't even make the
top 5.  I also fail to see how they have addressed the NUMA goal,
except tangentially, through the elimination of the "helper thread".

FreeBSD must first consider that its "top ten goals" for a threads
implementation are different than those of Linux, and see the Linux
design in light of that.

Here are my comments; they are in order against the content of the
paper, so it should be easy to read the comments linearly, with
the paper in hand.


The argument about collaborative scheduling being a reason to
abandon an N:M model for a 1:1 model is, I think, specious.  It
assumes certain things about the kernel implementation of the
kernel threads scheduling.  Specifically, it tries to duck the
thread group affinity issue, and simplify the CPU affinity issue.

Perhaps this is the reason that the original reduced runtime
overhead vis-a-vis processes goal has been abandoned: the use of
explicit code in the scheduler to try to achieve this is an N-P
incomplete problem.  The thread group affinity can be addressed
the the use of scheduler activations; but the CPU affinity for
kernel threads -- and the CPU *negaffinity* for kernel threads
which are members of the same thread group -- can not really be
adequately addressed in the context of intentional scheduling
and CPU migration.  In other words, they can not be addressed in
the kernel scheduler, they *must* be addressed from a data-flow
perspective (hence my occasional advocacy of per CPU scheduling
queues, with queue migration being a relatively rare, and also a
pushed-based event, rather than a relatively prequent, and also
a pull-based event).

By going to 1:1, they dodge this issue; but in so doing, they
increase threads overhead to that of processes: threads become
a mechanism for sharing some resources: the moral equivalent of
the old rfork()/sfork() system calls for heap and descriptor
space sharing, with seperate stacks.

The procfs issue is really a red herring; specifically, it should
have been possible to change the procfs interface so that 100,000
PIDs had a log2(100,000)+1, O(1), lookup time - LL[1].


The kernel space overhead that they describe with regard to the
user space threads scheduler context switch is predicated on the
idea of a context switch for a segmentation register, %FS or %GS,
which can not be set in user space, being a protected resource,
and so requires a kernel call in order to provide access for the
implementation of thread local storage chosen.

I am not personally familiar with the FreeBSD scontext(2) call
that has been discussed as an implementation detail with regard
to some of the recent postings by Daniel Eischen, in which he
implied Jonathan Mini will be implementing the call.  Perhaps I
should have been watching this more closely, so that I could
comment more intelligently on what it's supposed to do for the
FreeBSD implementation which can not be done without: the Linux
people may have a valid argument here which FreeBSD should heed.
However... such a call is necessary to deal with register windows
on SPARC processors, in any case, and the implementation is moot
by way of the fact that trading a user context switch plus a
kernel call for a kernel context switch hardly reduces overhead.

I think the "the Linux scheduler is O(1), so we don't have to
care" argument is invalid; O(1) means that for N entries, there
need to be N elements traversed 1 time.  What this really means
is that, as the number of schedulable entitites goes up, the time
required goes up linearly, as opposed to going up exponentially;
or, better, to *not* going up in the first place.


The signal handling argument is rather specious.  The first thing
that should be pointed out in this regard is that the overhead
involved in signal processing being funneled through a single
thread presumes an implementation of a monitor thread, which is
not necessarily a requirement.  But even granting that argument
for a naieve implementation, signals are, in a threaded application,
exceptional conditions: they are not intended to be used as IPC
mechanisms, and, in fact, the POSIX definition of the interaction
of signals and threads grants sufficient lattitude that they are
not in fact useful for this purpose in portable software.  Instead,
POSIX defines a large number of new facilities to be used for IPC
in a threaded environment, such that signals are not necessary.

The second thing that should be pointed out is that delivery of
signals is to a particular *kernel* thread.  While this means that
deliver to user space, it does not mean delivery to a specific
user space thread, which may then be a bottleneck; rather it means
delivery in order to communicate the signal to the user space
scheduler -- from which it may then be appropriately dispatched.
So in the N:M case, if N >> M, the relative overhead may approach
this overhead, the common implementation is the have M ~= the total
number of simultaneous blocking contexts, plus one.  So while it's
true that N != M, it *isn't* true that N ~= M, and it isn't true
that N >> M.


I think that FreeBSD can learn from the Linux analysis of the signal
handling, but that it should *not* learn the implementation detail
of the 1:1 relationship between kernel and user threads.  The paper
outlines almost all of the important issues in signal handling, but
that does not necessarily mean that the solution chosen is the most
appropriate solution for FreeBSD.


The "helper thread" is a concept which first appeared in the SVR4
derived Solaris ~2.4 era, and was echoed into the SVR4.2 threads
implementation.  The need for this thread is an artifact of a
limited kernel upcall capability, with the need for the user space
to execute code on behalf of kernel events.

I agree that a "helper thread" can be a significant bottleneck;
but a design need not utilize a helper thread.  The main purpose
of such a thread is to intermediate in the creation and completion
of other threads.  The reason this is necessary is the fact that
the kernel is permitted to block operations in times of low
resources, and one of these operations so blocked may be the
creation of new threads, etc..

The problem which this is solving is the lay of an asynchronous
means of posting a request to the kernel, yet continuing to run
in user space, despite the fact that the request posted may not
complete immediately.  In other words, it is a call conversion
mechanism intended to deal with the fact that certain calls are
blocking calls which have no asychronous counterparts for use in
call conversion:

NB: For the uninitiated: call conversion is a means of implementing a
threads system, by trading a blocking call for a non-blocking call
plus a context switch; obviously, this does not work, if there is no
non-blocking equivalent for the blocking call in question.  FreeBSD
recently had to deal with this issue with the sendfile(2) system call,
which has poor semantics relative to the use of socket descriptors
that have been marked as non-blocking.

Alternative ways of dealing with this problem are an asynchronous
call gate (which is the most general possible mechanism, but needs
a new system call entry point), and scheduler activations.  In other
words, getting rid of the "helper thread" does not require the 1:1
model adoption, as the paper implies.  However, the paper *also*
implies that it is necessary to go to this model to avoid the
serialization (though it does not come out and say this explicitly);
this implication is incorrect.

Editorial Note: I will note for the record that FreeBSD is now
considering an alternate system call entry point for future work,
to permit the ABI to change over time for the 64 bit conversion,
so that the prior ABI can be dropped at some point in the future.
The desire to *NOT do this* was what lead to the choice of scheduler
activations instead of an async call gate mechanism in FreeBSD's KSE
implementation.  I would *STRONGLY* recommend that, should a new ABI
system call entry point come to pass, that it include an additional
initial parameter for an async call context, which, if NULL, is then
treated as a request for the call to be synchronous.


In section 5.4, they argue traversal overhead.  They argue that the
list of threads must be traversed by the "helper thread" when a
process exits, and that having the kernel do this "if process exist"
[SIC - should be "exits"] is somehow less trouble -- but a traversal
is a traversal.

Eventually, they conclude that the list must remain for fork(2)
handling in any case.

FreeBSD could learn from a number of optimizations in this area; I
personally do not like the generation count approach, and would
want to use the space for a reverse pointer list linkage instead
(as one example), but the point about the places there are overhead
is a valid one.


The "futex" implementation is a good idea.  I have long been an
advocate against the use of recursive locking, because, IMO, it
makes programmers lazy, and leads to extra locking which should
be avoided.  It also has significantly higher overhead, compared
to non-reentrant versions.  I dislike locking overhead, particularly
when it's gratuitous (IMO).  FreeBSD could learn a lot here.


I like the DTV optimization in the memory allocation arena.  I've
advocated a similar approach in the VFS, where the vnode and the
per VFS data structure would be allocated at the same time, with
the vnode ownership belonging to the VFS, rather than being a
global system resource that's contended among all VFS's.

The freelist on deallocation is an obvious optimization.  They
imply the existance of a high watermark facility for the freelist
being needed, but don't specify whether or not this is actually
implemented.  IMO, this is mostly an issue for test programs, and
for changing load characteristics for a single machine (e.g. an
HTTP server that gets hit very hard, starts 100,000 threads, and
then doesn't free the memory back to the system, and then is
expected to act as a database server, instead, and needs to
reassign resources -- particularly memory -- to the new use).  In
practice, this is probably not an isue, but it's probably a nice
benchmark optimization that isn't really otherwise harmful for
performance on real loads, so it's something to consider.


As for the kernel improvements, FreeBSD could learn optimizations
from them.  I don't think that this is a critical area for 5.0,
but it's probably something that needs to be addressed for 5.X,
going forward.

The forking optimizations are particularly salient, I think.  The
signal handling is, I think, self-serving: it's a strawman that
justifies the design decisions they've already made.  As I pointed
out already, signals are not the preferred IPC mechanism, in any
sense, in the POSIX threads environment; I will go further: I
claim that an interactive program which must deal with STOP/CONT
will in fact almost never have more that 5-9 threads, based on
the fact that it's interacting with a human which (as an average)
can only keep 5-9 things in short term memory at a time... contexts
above and behyond that in an interactive program are simply bad
design, and even if we increase that number to 20, it's not a
significant source of overhead.  Most of the other changes are
what I would class as houskeeping tasks.  One exception is the use
"futex" wakeup in order to improve thread joins: FreeBSD should
look closely at this.


All in all, quite the interesting paper.


-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3D8B62DB.C27B7E07>