Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 26 Mar 2003 03:36:57 -0500 (EST)
From:      Jeff Roberson <jroberson@chesapeake.net>
To:        Julian Elischer <julian@elischer.org>
Cc:        kse@elischer.org
Subject:   Re: 1:1 Threading implementation.
Message-ID:  <20030326031245.O64602-100000@mail.chesapeake.net>
In-Reply-To: <Pine.BSF.4.21.0303252335280.22804-100000@InterJet.elischer.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 26 Mar 2003, Julian Elischer wrote:
> On Tue, 25 Mar 2003, Jeff Roberson wrote:
>
> > Thanks to the foundation provided by Julian, David Xu, Mini, Dan Eischen,
> > and everyone else who has participated with KSE and libpthread development
> > Mini and I have developed a 1:1 threading implementation.  This code works
> > in parallel with KSE and does not break it in any way.  It actually helps
> > bring M:N threading closer by testing out shared bits.
>
> The current design was done specifically so that the 'component parts
> could be recombined in different groupings to give different threading
> models. This was one of the models considered when the group
> discussed it. I'm glad that it is working..

Yep, that was a good design goal.

> >
> > I have successfully run mozilla 1.2.1 using this threading package.  It
> > still has some bugs and some incomplete corners but we're very close to
> > being able to commit this.  I'm going to post a link to the kernel portion
> > of this code at the end of this mail.  The library will come later.
>
> I wondered what was going on there.. There's been a trmendous silence in
> the userland side of things.

Well, I wasn't doing userland stuff until three days ago.  I think mini
has just been very busy with work.  I suspect that you're going to need
to start doing userland work or find someone to do it if you want to get
it done soon.

> >
> > What this means is that for every pthread in an application there is one
> > KSE and thread.  There is also only one ksegroup per proc in this model.
> > Since the kernel knows about all threads it handles all scheduling
> > decisions and all signal delivery.  I have followed the POSIX spec while
> > implementing the signal code.  I would really appreciate review from
> > anyone who is intimately familiar with signals and threads.  Included in
> > this is an implementation of sigwait(), sigtimedwait(), and sigwaitinfo().
>
> Wouldn't it have been easier to have one KSEGRP+KSE+thread per user
> thread? Having one ksegrp and many KSEs requires changing the kernel
> code where doing it the other way you could do it without making any
> changes.

I don't understand?  There are relatively minor changes to the kernel to
support this.  Since nice is a property of the process, it makes sense
that there is only one ksegrp per process.  I'm starting to think that the
ksegrp was overkill in general.

> Specifically since My plan is to make the "KSE' structure go away..
> (by which I mean it is only going to be visible within the particular
> thread_scheduler that uses it and that externally
> the only structures visible would be :
> proc, ksegrp(subproc?) thread and upcall.

For M:N I really think this should be proc, thread, and upcall.
For 1:1 I only need proc and thread.

> The KSE would be allocated only by a call into the scheduler and is part
> of the "scheduler specific private data".
>
> i.e. on creation of a new process, shced_newproc() is called
> and a KSE is added in there is the scheduler in question wants to use
> KSEs. If it doesn't, no KSE would be added, but it's still possible that

Yes, I think we need more sched hooks here as well.  Having only
sched_fork() makes things sort of gross.  We'll have to hook this all up
later.

> some scheduler specific storage might be added. In the case
> of a new upcall being declared (kse_create() (to be renamed))
> sched_make_threaded() is called which adds KSEs to the KSEGRP
> (I was going to change it to be called a subprocess).
> KSEs are an accounting aid for the scheduler. A differnt scheduler may
> decide to put threads themselves onto the run queues which would
> make KSEs un-needed. (for example)
>
> >
> > The user land mutexes are supported by kernel code.  Uncontested acquires
> > and releases are done entirely in application space using atomic
> > instructions.  Once there is contention the library falls back to system
> > calls to handle the locks.  There are no per lock kernel resources
> > allocated.  There is a user space safe atomic cmpset function that has
> > been defined for x86 only at the moment.  New architectures require only
> > this function and the *context apis to run this threading package.  There
> > is no arch specific code in user space.
>
> This was discussed recently as being the highlight of someone's
> threading model (I think Linux but I am not sure who's).

Yes, linux was discussing this.  It's a pretty common trick.  Even NT does
it but apparently NT allocates kernel resources for user locks.  I was
pretty pleased that I got away without any per lock allocations.

> >
> > The condition variables and other blocking situations are handled with
> > sig*wait*() and a new signal, SIGTHR.  There are many reasons that we went
> > with a signal here.  If anyone cares to know them, you may ask.
> >
> > There are only 4 system calls for threading. thr_create, thr_self,
> > thr_exit, and thr_kill.  The rest of the functionality is implemented in a
> > library that has been heavily hacked up from the original libc_r.
> >
> > The reason we're doing this in parallel with the M:N effort is so that we
> > can have reasonable threading sooner.  As I stated before, this project is
> > complimentary to KSE and does not prohibit it from working.  I also think
> > that the performance will be better or comparable in the majority of real
> > applications.
>
> My only comment is that since mini is supposed to be doing the
> M:N library, isn't this a bit of a distraction?

I'll let him comment on this.

> >
> > The kernel bits are available at
> > http://www.chesapeake.net/~jroberson/thr.diff
>
> Please explain what this means:
> -       mask = td->td_proc->p_sigmask;
> +       mask = td->td_sigmask;
>
>
> how can you have a per thread mask?
> Signals are masked for the entire process..
> How do you keep them in sync with each other?

As per POSIX each thread has a signal mask.  There is a per process
sigaction but per thread mask and pending.  This has to be the case even
for M:N although some of it is hidden by the UTS.  libc_r even keeps per
thread pending and mask bits.

> -       if (p1->p_flag & P_THREADED) {
> +       if (p1->p_flag & P_THREADED || p1->p_numthreads > 1) {
>
> If you are running threads, please set the P_THREADED flag.
> if you wnat do differentiate between upcalling threads and 1:1
> threads, please use some auxhilliary flag.

I'd rather not have a flag.  The > 1 check is used only in places where we
have to suspend multiple threads or go to single threading etc.  Processes
in the 1:1 threading model aren't so special as they are with KSE.  They
don't need to be treated specially except when we're trying to funnel them
down etc.

> You should be creating a new KSEGRP (subproc) per thread.
> I think you will find that if you do, things will fall out easier
> and you won't break the next KSE changes.

I don't understand what I may break?

> >
> > I'd like to get the signal code commited asap.  It's the majority of the
> > patch and I often have to resolve conflicts.  There have been no
> > regressions in KSE or non threaded applications with this signal code.
>
> I'm not against having a separate 1:1 thread capability, but
> all this work could have been well spent getting M:N threads
> better supported and even getting it to
> be able to run in 1:1 mode a s a byproduct..

I don't think M:N is the way to go.  After looking things over and
considering where it is theoretically faster I do not think it is a
worthwhile pursuit.

First off, it is many months away from being even beta quality.  I think
the UTS is far more complicated than you may realize.  There are all sorts
of synchronization issues that it was able to avoid before since only one
thread could run at any time and there essentially was no preemption.  It
now also has to deal with effecient scheduling decisions in a M:N model
that it didn't have to worry about before.

Aside from that, there are numerous problems with the kernel not being
able to identify individual threads of execution.  Debugging, scheduling,
profiling, ktrace are all more difficult in a m:n environment.  I think it
is going to contribute to less effecient scheduling decisions over all.  I
have already wrestled with this in ULE.

I feel that this is an overwhelming amount of complexity.  Because of this
it will be buggy.  Sun claims that they still have open tickets on their
M:N while their new 1:1 implementation is totally bug free.  How long have
they been doing m:n?  I don't think that with our limited resources we're
going to be able to do better.

Furthermore, m:n's basic advantage is less overhead from staying out of
the kernel.  Also, less per thread resources.  I think this is bogus for a
couple of reasons.

First, if your application has more threads than cpus it is written
incorrectly.  For people who are doing thread pools instead of event
driven IO models they will encounter the same overhead with M:N as 1:1.
I'm not sure what applications are entirely compute and have more threads
than cpus.  These are the only ones which really theoretically benefit.  I
don't think our threading model should be designed to optimize poorly
thought out applications.

Furthermore, the amount of work done per slice has been growing with
processor speeds.  Slice time is adjusted for user experience and so it
remains constant.  This means that the constraints are different from when
this architecture started to come about many (10 or so?) years ago.
Trying to optimize context switches between threads just doesn't make
sense when you do so much work per slice.

Then if you look at the number of system calls and shenanigans a UTS must
do to make proper scheduling decisions it doesn't look like such an
advantage.  I feel that the overhead of all the layers comes close to the
savings from doing some of it without entering the kernel.

In short, even if it is marginally faster, it doesn't seem like it is
worth the effort and risk.  I don't want to discourage you from trying but
this is why I stopped working on KSE proper and pursued the 1:1 model.

Cheers,
Jeff



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20030326031245.O64602-100000>