Date: Mon, 21 Apr 2003 23:51:23 -0700 From: Terry Lambert <tlambert2@mindspring.com> To: Jeff Roberson <jroberson@chesapeake.net> Cc: Daniel Eischen <eischen@pcnet1.pcnet.com> Subject: Re: libkse -> libpthreads Message-ID: <3EA4E66B.52980656@mindspring.com> References: <20030422004950.R76635-100000@mail.chesapeake.net>
next in thread | previous in thread | raw e-mail | index | archive | help
Jeff Roberson wrote: > On Mon, 21 Apr 2003, Terry Lambert wrote: > > It wouldn't. The main issue as far as performance went, and why > > we (Novell USG) used processes instead of SVR4 threads, and did > > file descriptor table sharing, and shared client context data in > > a shared memory segment (8-)) is that SVR4-derived systems without > > a unified VM and buffer cache do a lot of page thrashing. > > Please explain how using processes instead of threads improves page > thrashing. SVR4.0.2 (Dell UNIX) and SVR4.2 (UnixWare 2.x) have a seperate VM and buffer cache. Because of this, you tend to get page thrashing under any overload condition, even for nominally shared code pages, if you are doing a lot of data pages work. The problem is most easily seen in the UnixWare 1.x, prior to the introduction of the "fixed" scheduling class. In order to put memory pressure on the system, in a UI-visible way, run X Windows, and then perform a compilation on a large project. When the ld program is run, it will mmap() all of the .o files, and then randomly access them in quick succession, in order to perform symbol resolution for the large project. When this happens, the UI will "lock up", and you will effectively lose the ability to move the mouse. As you attempt to move the mouse, the mouse will not move, and it will trigger paging in of the X server, and then paging in of the application, both of whose code pages were forced out of core (and will be forced back out of core again, immediately) by the ld's access to data pages. So once every one or two seconds or so, it will move, generate expose events, and lock up again. The net effect is the system appears to lock up, either completely, or for multiple seconds at a time. The UnixWare 2.x/SVR4.2 solution to this problem is to introduce a "fixed" scheduling class, so that a fixed percentage of the CPU time is dedicated to processing the X server code. This doesn't stop it from being paged out, but it does provide a fixed amount of CPU to spend paging it back in, and then doing some processing on top of that (basically, I/O is accounted on a per process basis). This was basically a lazy way of introducing a "precious" working set low watermark for the X server pages. Much better to have established a per-file quota for the .o files themselves, so ld might thrash, but the only program that would get hurt by it would be ld. Now consider the specific case we were dealing with, which is the NetWare for UNIX (formerly Portable NetWare) problem. In this case, if we were to use the SVR4/Solaris threading (this was after the merge of the Solaris and SVR4.2 code bases, as part of a joint project between Sun and USL, in which Sun got VM and FS code, and USL got the threads and some other code, in trade). The implementation paradigm for this code was as "anonymous work to do engines" -- essentially, the server consisted of a number of specific tasks (an intention mode transaction based long manager, a monitoring daemon, some miscellaneous tasks), and a number of identical tasks which implemented "work to do engines" -- all the latter tasks were identical, in that the client context for any client session was known to all of them. As a result, any of these tasks could service any request. Since the NCP packets are, with the sole exception of delayed lock grants, which are reported async via a covert channel, request/response in architecture, the number of concurrent client requests is limited by the number of tasks that are available to service the requests. Our intent was to be able to service a large number of clients. Now consider that, while maximum concurrency was an issue, so was locality of file data sets, and locality of code pages, with the two contending for the limited available divided memory pools that were contended between the VM and buffer cache (effectively, there was a total set size, with a reserve held back for each type of pages, and the remaining pages were contended). Use of the "fixed" scheduling class was not an option. Using threads would not allow prefferential scheduling between the tasks, neither would it have allowed sharing of all client context (though it would have allowed descriptor sharing) without some form of marshalling and locking. This is because a client that did not believe the server was responding "fast enough" would repeat the request. It was necessary to respond to these clients with a "server busy" message. The reason behind this because IPX is a unreliable datagram protocol, like UDP, and does not have a retry mechanism built into it. The upshot of this is that, with threads, the per process working set would be very large, and would be fragmented across the process address space. This increased contention, well above what a process could withstand, without forcing VM pressure on the buffer cache. But the reason for the existance of the software was a *file* server, so this was unacceptable. By seperating the address spaces, this pressure was reduced, and the amount of overall contention was reduced, thus reducing the buffer cache pressure from the processes. In the limit, with all processes fully utilized (i.e. a request backlog at the stream MUX), it equalled out in performance. In the common case, however, not all tasks were utilized all the time, and it was possible to allow them to be paged out. On top of this, there were a number of speed benefits to System V shared memory for the client contexts; if you have read "The Magic Garden Explained", these should be pretty obvious. John Dyson made a number of similar changes to the FreeBSD implementation for Oracle Corporation, when Oracle was using FreeBSD as the basis of its "Network Computer" server. Basically, the pages are VM pages only, with no write-through to the backing store; in the SVR4 case, this would have been buffer cache pages, backed by swap, if this were anonymous memory instead (the kind you get in a threads heap). Thus we come to part 2, which is that we modified the streams MUX to ensure that requests were assigned to engines as they entered the stream mux FD with a write+read request in LIFO, rather than FIFO order. By doing this, were able to ensure that, most likely, the pages which were going to be requested were "in core" in the process making the request (performing default FIFO ordering would have resulted in a guarantee that the pages were not in core). I dubbed this approach "hot engine scheduling". Attempting to use a similar approach in the threads case, besides the completely fragmented process memory that caused a much larger number of pages to need to be resident to do the same work, the MUX assignment of "work to do" would in fact have been effectively "random". Anything less than total utilization of the system was *worse* with random allocation of work units, and *better* with LIFO allocation. And that's why using processes instead of threads resulted in less page thrashing. There were, of course, other reasons for using processes, instead of threads, the primary among which was "better quantum utilization" (SVR4.2/UnixWare 2.0 did not support thread group affinity in the scheduler; as you have discovered, supporting that is NP-hard, unless you get tricky, and make migration explicit and initial selection intentional). -- Terry
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3EA4E66B.52980656>