From owner-freebsd-hackers  Tue May  9 18:48:13 1995
Return-Path: hackers-owner
Received: (from majordom@localhost)
          by freefall.cdrom.com (8.6.10/8.6.6) id SAA03548
          for hackers-outgoing; Tue, 9 May 1995 18:48:13 -0700
Received: from cs.weber.edu (cs.weber.edu [137.190.16.16])
          by freefall.cdrom.com (8.6.10/8.6.6) with SMTP id SAA03542
          for <hackers@FreeBSD.org>; Tue, 9 May 1995 18:48:11 -0700
Received: by cs.weber.edu (4.1/SMI-4.1.1)
	id AA20511; Tue, 9 May 95 19:41:28 MDT
From: terry@cs.weber.edu (Terry Lambert)
Message-Id: <9505100141.AA20511@cs.weber.edu>
Subject: Re: Apache + FreeBSD 2.0 benchmark results (fwd)
To: bakul@netcom.com (Bakul Shah)
Date: Tue, 9 May 95 19:41:27 MDT
Cc: hackers@FreeBSD.org
In-Reply-To: <199505100001.RAA29299@netcom14.netcom.com> from "Bakul Shah" at May 9, 95 05:01:38 pm
X-Mailer: ELM [version 2.4dev PL52]
Sender: hackers-owner@FreeBSD.org
Precedence: bulk

> > Nope; just generic CS-speak.
> 
> `spawning' is approx. equal to fork + exec, not just fork.
> A system that provided spawn can do certain optimizations
> (with a loss of some flexibility).  I have never heard of
> pre-spawning.  Are you trying to, er.., spawn new words?

spawn-ahead, not "pre-spawn".  8-).

> > >     What, have a specially-compiled kernel that can fork off httpd's
> > > in no time at all?
> 
> FYI, this has been used in atleast one system.  The kernel
> kept a number of server processes.  Under heavy load more
> processes were forked off, under light load extra idle
> processes were killed off.  A variation of `select' to do
> this may be something worth investigaing.  Typically a
> server process does a select to wait for a request to
> arrive.  If you want to provide more threads of control
> dynamically, and these threads agree to cooperate, you would
> use this hypothetical `server-select'.  Once some parameters
> are set, the kernel would dynamically add or remove threads
> depending on traffic (but only in serever-select).

This is called a "work to do" model.  Actually, the NetWare for UNIX
(NWU) release that is upcoming or which has recently been released
uses this model.  The actual dispatch is done to the "hot" engine
by a streams multiplexer call NEMUX (NetWare Engine MUX) to save
process context switch overhead on top of everthing else.

Oracle on Sequent machines also uses "work to do", as does Native
NetWare (although Native NetWare is *not* SMP scalable because it
can't handle kernel preemption implicit in an MP environment -- in
effect, code reentrancy and cross-processer synchronization).

The main distinction of this model is that the server processes
share identical coontext; each incoming NCP request and it's response
from the server is considered a "transaction" for the purposes of
the model.

The main advantage to LIFO scheduling of engines is that it avoids
the process context switch overhead as much as possible -- the
design is SMP scalable on the basis of handling multiple "hot"
engines.  The point in doing this is that the write of the response
packet and the read for the next packet is a single operation, and
on an active server, it's possible for it to be non-blocking.  The
result is that "hot" processes run for their full quantum.

The disadvantage of user level spawn ahead is dispatch scheduling
(via pipes, UNIX domain sockets, or other IPC mechanism) requires
two context switches: one to run the dispatcher, one to run the
dispatchee.  In addition, since there is a blocking operation called
"wait for next work element" or equivalent per operation, the
processes never use their entire quantum.  This is subject to
the "thundering herd" problem of semaphore acquisition unless you
build your own semaphore system, get one built for you by the OS
vendor (ie: Sequent), or are very, very careful (and complex) in
your dispatch code.

The SVR4/Solaris threading model is similarly broken; each kernel
thread exists to perform a single blocking operation; the process
set savings over a context switch are minimal over a seperated
process implementation because nothing eats its full quantum.  For
something like a file or http server, the net effect is needing as
many kernel threads as user space threads to warranty against all
kernel space threads being blocked while user space threads are
waiting only on kernel thread availability to run.

The trade-off on a 1:1 mapping of user/kernel threads vs. seperate
processes is that the thread model shared the heap and the descriptor
table between threads.  To do this on a seperate process model, you
have to explicitly use shared memory for global data, and use a
mechanism like sfork (Sequent) or SFD (UnixWare -- say "Thank You
Terry", all you database nuts) to share the descriptor table between
processes.  On the down side, the thread model has problems with
having to pre-statically allocate each threads stack and potential
load balancing issues when there are more kernel threads than there
are processers and there is more than one processer.

> > If you don't convert the I/O requests, then you aren't really a
> > multithreaded server at all, since a blocking request in any thread
> > of control can block other threads of control that would otherwise
> > be runnable.
> 
> Correct.  Select() in this context is suitable for servicing
> lots of `short term' requests, not long lasting ones.  So
> long requests should handed to another thread.  But doing so
> in Unix eats up more time.

There is also the unaddressed issue of interleaved I/O.  Using this
model, you can not do a "team/ddd" style predictive I/O interleave
to shorten overall latency (in effect, team/ddd average a single
latency over the entire transaction, much in the same way a sliding
window in TCP or Zmodem can reduce latency).  You *will* get better
performance with async operations.

> > The other alternative is a Non-Blocking I/O Dispatch model, where you
> > guarantee that you will not attempt potentially blocking operations
> > in the context of a dispatched thread of control.     ...
> >               ...                     Depending on what you do with
> > this model, you can actually end up with some very complex finite
> > state automatons to get the desired behaviour.
> 
> Right.  I think `asynchronous' IO probably provides the best
> performance with a moderate increase in complexity.  Alas,
> we can't do that under Unix:-(  Any thoughts on a decent
> implementation of that?

An async I/O implementation is relatively simple; you can implement it
one of two ways; the first is via an alternate call gate or other flag
mechanism that would allow *all* blocking system calls to be done
asynchronusly; this would be of much more general utility than the
VMS AST facility, which is limited to a subset of calls (the VMS native
threading implementation, MTS, uses this; I had to expand it somewhat
to support the process model in the Pathworks for VMS/NetWare code
when we were writing it, and the lack of the facility for some calls
was a big pain).

The second is to use a VMS-limited-call-approach, like SunOS did for
support of LWP (and which SVR4 adopted for some reason) using aioread,
aiowrite, aiowait, and aiocancel system calls.  This sucks if you want
to use, for instance, message queues or other non-fd based blocking
operations.  You have to dedicate a seperate process and use a thread
safe IPC mechanism (like async reads on a pipe) to convert the blocking
operation into an fd operation that the aio calls can understand.  The
aiowait call is used by the scheduler when all threads have outstanding
blocking operations, while the aiocancel is generally reserved for
signal delivery (another problem in an LWP style threads implementation)
and process rundown.

The problem with a threading model composed entirely of sync operations
being changed to async operations + a context switch is that it avoids
the process context switch overhead, but it has significantly less
quanta to divide between its threads.

This can increase overall application latency becaue it is easy for the
server process to use 100% of its quantum under even moderate loading,
with the result of "less essential" tasks crowding the server out.

The SVR4/Solaris soloution to this would be to write a new scheduling
class, assigning effectively better than timesharing quanta priority
to the process (a gross hack, like they use to get move-mouse/wiggle-cursor
behaviour out of their X server in spite of VM/linker interaction
problems caused by a lack of working set limits per vnode or per process)).

Of course, since an async I/O based implementation has a single kernel
scheduling entity ("process"), it is not SMP scalable.


Probably the *best* approach would involve a cooperative thread scheduler
that used async I/O to eat all of the quanta per kernel scheduling
entity (kernel thread as opposed to "process") bound to a user space thread
set, with multiple kernel scheduling entities for competitive reasons
relative to other processes on the system and a set of these "bound" to
each processer (compute resource) on the system.

This presumes that the server to run is either the most important thing
on the system, or that other processes of equal importance are implemented
using the same model.


Actually, it's disgusting the number of things that would get impacted
by even async I/O in a UP (Uniprocesser) environment simply because the
best timing granularity for event wakeup (like, oh, say, select or itimer)
would go down to the equivalent of the lbolt forced context switch clock
before the event was serviced.  You could probably bias the scheduler
using two stage queue insertion, at the risk of having processes scheduled
as the result of a "time important event" starving out those scheduled as
a result of a "normal event" (like a disk buffer being filled).  Then, of
course, you've started down the primrose path to kernel preemption.  8-).


					Terry Lambert
					terry@cs.weber.edu
---
Any opinions in this posting are my own and not those of my present
or previous employers.