From owner-freebsd-hackers Tue May 9 18:48:13 1995 Return-Path: hackers-owner Received: (from majordom@localhost) by freefall.cdrom.com (8.6.10/8.6.6) id SAA03548 for hackers-outgoing; Tue, 9 May 1995 18:48:13 -0700 Received: from cs.weber.edu (cs.weber.edu [137.190.16.16]) by freefall.cdrom.com (8.6.10/8.6.6) with SMTP id SAA03542 for ; Tue, 9 May 1995 18:48:11 -0700 Received: by cs.weber.edu (4.1/SMI-4.1.1) id AA20511; Tue, 9 May 95 19:41:28 MDT From: terry@cs.weber.edu (Terry Lambert) Message-Id: <9505100141.AA20511@cs.weber.edu> Subject: Re: Apache + FreeBSD 2.0 benchmark results (fwd) To: bakul@netcom.com (Bakul Shah) Date: Tue, 9 May 95 19:41:27 MDT Cc: hackers@FreeBSD.org In-Reply-To: <199505100001.RAA29299@netcom14.netcom.com> from "Bakul Shah" at May 9, 95 05:01:38 pm X-Mailer: ELM [version 2.4dev PL52] Sender: hackers-owner@FreeBSD.org Precedence: bulk > > Nope; just generic CS-speak. > > `spawning' is approx. equal to fork + exec, not just fork. > A system that provided spawn can do certain optimizations > (with a loss of some flexibility). I have never heard of > pre-spawning. Are you trying to, er.., spawn new words? spawn-ahead, not "pre-spawn". 8-). > > > What, have a specially-compiled kernel that can fork off httpd's > > > in no time at all? > > FYI, this has been used in atleast one system. The kernel > kept a number of server processes. Under heavy load more > processes were forked off, under light load extra idle > processes were killed off. A variation of `select' to do > this may be something worth investigaing. Typically a > server process does a select to wait for a request to > arrive. If you want to provide more threads of control > dynamically, and these threads agree to cooperate, you would > use this hypothetical `server-select'. Once some parameters > are set, the kernel would dynamically add or remove threads > depending on traffic (but only in serever-select). This is called a "work to do" model. Actually, the NetWare for UNIX (NWU) release that is upcoming or which has recently been released uses this model. The actual dispatch is done to the "hot" engine by a streams multiplexer call NEMUX (NetWare Engine MUX) to save process context switch overhead on top of everthing else. Oracle on Sequent machines also uses "work to do", as does Native NetWare (although Native NetWare is *not* SMP scalable because it can't handle kernel preemption implicit in an MP environment -- in effect, code reentrancy and cross-processer synchronization). The main distinction of this model is that the server processes share identical coontext; each incoming NCP request and it's response from the server is considered a "transaction" for the purposes of the model. The main advantage to LIFO scheduling of engines is that it avoids the process context switch overhead as much as possible -- the design is SMP scalable on the basis of handling multiple "hot" engines. The point in doing this is that the write of the response packet and the read for the next packet is a single operation, and on an active server, it's possible for it to be non-blocking. The result is that "hot" processes run for their full quantum. The disadvantage of user level spawn ahead is dispatch scheduling (via pipes, UNIX domain sockets, or other IPC mechanism) requires two context switches: one to run the dispatcher, one to run the dispatchee. In addition, since there is a blocking operation called "wait for next work element" or equivalent per operation, the processes never use their entire quantum. This is subject to the "thundering herd" problem of semaphore acquisition unless you build your own semaphore system, get one built for you by the OS vendor (ie: Sequent), or are very, very careful (and complex) in your dispatch code. The SVR4/Solaris threading model is similarly broken; each kernel thread exists to perform a single blocking operation; the process set savings over a context switch are minimal over a seperated process implementation because nothing eats its full quantum. For something like a file or http server, the net effect is needing as many kernel threads as user space threads to warranty against all kernel space threads being blocked while user space threads are waiting only on kernel thread availability to run. The trade-off on a 1:1 mapping of user/kernel threads vs. seperate processes is that the thread model shared the heap and the descriptor table between threads. To do this on a seperate process model, you have to explicitly use shared memory for global data, and use a mechanism like sfork (Sequent) or SFD (UnixWare -- say "Thank You Terry", all you database nuts) to share the descriptor table between processes. On the down side, the thread model has problems with having to pre-statically allocate each threads stack and potential load balancing issues when there are more kernel threads than there are processers and there is more than one processer. > > If you don't convert the I/O requests, then you aren't really a > > multithreaded server at all, since a blocking request in any thread > > of control can block other threads of control that would otherwise > > be runnable. > > Correct. Select() in this context is suitable for servicing > lots of `short term' requests, not long lasting ones. So > long requests should handed to another thread. But doing so > in Unix eats up more time. There is also the unaddressed issue of interleaved I/O. Using this model, you can not do a "team/ddd" style predictive I/O interleave to shorten overall latency (in effect, team/ddd average a single latency over the entire transaction, much in the same way a sliding window in TCP or Zmodem can reduce latency). You *will* get better performance with async operations. > > The other alternative is a Non-Blocking I/O Dispatch model, where you > > guarantee that you will not attempt potentially blocking operations > > in the context of a dispatched thread of control. ... > > ... Depending on what you do with > > this model, you can actually end up with some very complex finite > > state automatons to get the desired behaviour. > > Right. I think `asynchronous' IO probably provides the best > performance with a moderate increase in complexity. Alas, > we can't do that under Unix:-( Any thoughts on a decent > implementation of that? An async I/O implementation is relatively simple; you can implement it one of two ways; the first is via an alternate call gate or other flag mechanism that would allow *all* blocking system calls to be done asynchronusly; this would be of much more general utility than the VMS AST facility, which is limited to a subset of calls (the VMS native threading implementation, MTS, uses this; I had to expand it somewhat to support the process model in the Pathworks for VMS/NetWare code when we were writing it, and the lack of the facility for some calls was a big pain). The second is to use a VMS-limited-call-approach, like SunOS did for support of LWP (and which SVR4 adopted for some reason) using aioread, aiowrite, aiowait, and aiocancel system calls. This sucks if you want to use, for instance, message queues or other non-fd based blocking operations. You have to dedicate a seperate process and use a thread safe IPC mechanism (like async reads on a pipe) to convert the blocking operation into an fd operation that the aio calls can understand. The aiowait call is used by the scheduler when all threads have outstanding blocking operations, while the aiocancel is generally reserved for signal delivery (another problem in an LWP style threads implementation) and process rundown. The problem with a threading model composed entirely of sync operations being changed to async operations + a context switch is that it avoids the process context switch overhead, but it has significantly less quanta to divide between its threads. This can increase overall application latency becaue it is easy for the server process to use 100% of its quantum under even moderate loading, with the result of "less essential" tasks crowding the server out. The SVR4/Solaris soloution to this would be to write a new scheduling class, assigning effectively better than timesharing quanta priority to the process (a gross hack, like they use to get move-mouse/wiggle-cursor behaviour out of their X server in spite of VM/linker interaction problems caused by a lack of working set limits per vnode or per process)). Of course, since an async I/O based implementation has a single kernel scheduling entity ("process"), it is not SMP scalable. Probably the *best* approach would involve a cooperative thread scheduler that used async I/O to eat all of the quanta per kernel scheduling entity (kernel thread as opposed to "process") bound to a user space thread set, with multiple kernel scheduling entities for competitive reasons relative to other processes on the system and a set of these "bound" to each processer (compute resource) on the system. This presumes that the server to run is either the most important thing on the system, or that other processes of equal importance are implemented using the same model. Actually, it's disgusting the number of things that would get impacted by even async I/O in a UP (Uniprocesser) environment simply because the best timing granularity for event wakeup (like, oh, say, select or itimer) would go down to the equivalent of the lbolt forced context switch clock before the event was serviced. You could probably bias the scheduler using two stage queue insertion, at the risk of having processes scheduled as the result of a "time important event" starving out those scheduled as a result of a "normal event" (like a disk buffer being filled). Then, of course, you've started down the primrose path to kernel preemption. 8-). Terry Lambert terry@cs.weber.edu --- Any opinions in this posting are my own and not those of my present or previous employers.