FreeBSD Mail Archives

Date:      Sat, 6 Mar 1999 19:13:21 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        jplevyak@inktomi.com (John Plevyak)
Cc:        tlambert@primenet.com, jplevyak@inktomi.com, hackers@FreeBSD.ORG
Subject:   Re: lockf and kernel threads
Message-ID:  <199903061913.MAA08788@usr06.primenet.com>
In-Reply-To: <19990305080618.B22589@tsdev.inktomi.com> from "John Plevyak" at Mar 5, 99 08:06:18 am

> > > Why is it too late after that?  In the patch I did the wait in exit1()
> > > right after the 'kill' of the peers.
> > 
> > It's too late if you've gone to user space with the signal, because
> > it's an untrappable signal.  That's the trampoline I was referring
> > to.
> 
> I understand about the trampoline, but I don't see why you can't send
> the signal out of user space, mostly because that is what 3.0+
> currently does! (near the beginning of exit1() to all peers of a p_leader.)

Assume a kill -9 of a PID that is a "thread" in a "thread group"
(process).

The exit call is not explicit.  The SIGKILL is mandatorily not caught.
The kill comes from the kernel, with no user space notification.


> > Right.  Such a structure is essentially identical in content to
> > an async call context for an async call gate, BTW.  8-).
> 
> I am missing context on this, however if it was a mechanism for
> generalized async calls w/o signals/polling but 'wait for one
> of N events' I am interested.

Yes, that's exactly it.

Basically, you use an alternate entry point to the system call trap
code, and immediately return after queuing the trap to the kernel.

The context you need is a kernel stack for the call, a kernel
program counter, etc..

Basically the call runs as far as it can (potentially to completion),
and instead of putting the process to sleep, puts the call context to
sleep, returning the remainder of the quantum to user space.


This provides SMP scalability (call contexts can be serviced by
multiple CPU's) without requiring strict kernel threads to implement.

You can program with the calls directly (like the current AIO code)
to get operation interleave, or you can implement a user space call
conversion scheduler, which converts standard calls into async calls
plus a thread context switch.

The advantage to doing this method of implementation of call conversion
is that the call wrapping code is then generic, and no additional work
needs to be done for threads, other than wrapping the conversion.  In
this way, the implementation is drastically simpler, and drastically
lower overhead than the current call conversion scheduler, which must
do wrapping and locking in user space.

In addition, this implementation has significantly smaller context
switch overhead, compared to traditional kernel threading models,
since making a blocking system call is not the same thing as an
explicit scheduler yield, as it is with kernel threads.  When the
scheduler assigns a quantum, it's your quantum.

You can deal with multiple CPU's being in user space simultaneously
in the same process trivially: you just allow more than one CPU to
return up with the code from the converted call, and enter your
call conversion scheduler -- the real reason for threads is to
provide procedural work-to-do for a quantum as a scheduled resource.


There are some obvious optimizations, as well.  You could easily
classify system calls with another entry in the struct sysent[]
structure element for a given call, and act accordingly, by type:

o	call may block

	Treat as the unoptimized case, and return when you hit a
	tsleep that blocks the call.

o	call never blocks

	Do not allocate a call context, and trivially return a
	static context for completion.

o	call always blocks

	Obtain parameters and immediately return so that the
	long latency event (e.g., tty I/O) does not detract
	from overall throughput.

Another optimization which is obvious after consideration, is to
modify trap entry for SMP architectures, and immediately return
all but the calls which never block.  The calls which never block
are treated as if they are static data in user space, and merely
referenced via accessor functions for convenience and abstraction.
Onece the context is in the kernel, it is given to the scheduler,
where it competes for a quantum on whatever CPU becomes next
available.

This is a bit counterintuitive, in that we are attempting to reduce
scheduler context switch overhead.  However, it doesn't matter that
the quantum is given back to the system (or, alternately, given to
the threads scheduler in the process that made the request, based
on some administrative fiat) because these events are intrinsically
short duration.  In other words, they are expected to run to sleep,
then run to completion following a wakeup, both in less than a
quantum.  Since the sleep discontinuity is impossible to overcome,
and the completion discontinuity is also impossible to overcome,
this is the minimal implementation of "non-interrupt scheduled kernel
work-to-do".


The possbilities are really pretty great.  The architecture bypasses
most of the issues for SMP scalability, without taking a hit for
having ignored rather than dealt with the issues.  Additionally,
the architecture in the second optimization can, in fact, be used
to proxy the request in a distributed computing environment.


Obviously, looking at this architecture, one of my hot buttons is
massively parallel distrbuted systems with completion agents that
operate with  incomplete knowledge.  I think these types of systems
are going to be THE Most Important(tm) in the long to medium term
future -- e.g., they are exactly what is needed for operating a
fleet of hundereds of thousands of nanometer scale semiautonomous
machines doing organ repair.


> > You might be able to get someone to commit this, at least as a short
> > term soloution to the problem.  It would get you over the "blessed"
> > hump so you can concentrate on more pressing issues.
> 
> Hummm... any guesses as to whom might be most sympathetic?

Matt Dillion, David Greenamn, Julian Elisher, Peter Wemm all spring
immediately to mind.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199903061913.MAA08788>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation