From owner-freebsd-arch Fri Sep 20 11: 4:49 2002 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id EC0A137B401 for ; Fri, 20 Sep 2002 11:04:42 -0700 (PDT) Received: from avocet.mail.pas.earthlink.net (avocet.mail.pas.earthlink.net [207.217.120.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id 591C043E75 for ; Fri, 20 Sep 2002 11:04:42 -0700 (PDT) (envelope-from tlambert2@mindspring.com) Received: from pool0108.cvx21-bradley.dialup.earthlink.net ([209.179.192.108] helo=mindspring.com) by avocet.mail.pas.earthlink.net with esmtp (Exim 3.33 #1) id 17sS8S-0002wo-00; Fri, 20 Sep 2002 11:04:13 -0700 Message-ID: <3D8B62DB.C27B7E07@mindspring.com> Date: Fri, 20 Sep 2002 11:03:07 -0700 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Daniel Eischen Cc: Bill Huey , freebsd-arch@FreeBSD.ORG Subject: Re: New Linux threading model References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Daniel Eischen wrote: > On Thu, 19 Sep 2002, Bill Huey wrote: > > http://marc.theaimsgroup.com/?l=linux-kernel&m=103248252713576&w=2 > > http://people.redhat.com/drepper/nptl-design.pdf > > I read some of this and some of it is exactly opposite of why > scheduler activations was made in the first place. They are > pushing all scheduling decisions and locking in to the kernel. > One of the points of scheduler activations is that the library > can make all scheduling decisions without need for having > the kernel involved. Please take these comments in the context of "what this paper means to FreeBSD", rather than as a criticism of the Linux implementation; I'll point out what I think are fallacies, when I see them, but the main emphasis has to be "what does this mean to FreeBSD-arch"... I find it incredibly amusing that in their 10 listed goals for the system, the original design goal that resulted in the idea of threads becoming widespread in the first place, the idea that there would be less *runtime* overhead for using threads instead of processes, seems to have been lost. Now threads has other goals, but hadware scaling and NUMA support didn't even make the top 5. I also fail to see how they have addressed the NUMA goal, except tangentially, through the elimination of the "helper thread". FreeBSD must first consider that its "top ten goals" for a threads implementation are different than those of Linux, and see the Linux design in light of that. Here are my comments; they are in order against the content of the paper, so it should be easy to read the comments linearly, with the paper in hand. The argument about collaborative scheduling being a reason to abandon an N:M model for a 1:1 model is, I think, specious. It assumes certain things about the kernel implementation of the kernel threads scheduling. Specifically, it tries to duck the thread group affinity issue, and simplify the CPU affinity issue. Perhaps this is the reason that the original reduced runtime overhead vis-a-vis processes goal has been abandoned: the use of explicit code in the scheduler to try to achieve this is an N-P incomplete problem. The thread group affinity can be addressed the the use of scheduler activations; but the CPU affinity for kernel threads -- and the CPU *negaffinity* for kernel threads which are members of the same thread group -- can not really be adequately addressed in the context of intentional scheduling and CPU migration. In other words, they can not be addressed in the kernel scheduler, they *must* be addressed from a data-flow perspective (hence my occasional advocacy of per CPU scheduling queues, with queue migration being a relatively rare, and also a pushed-based event, rather than a relatively prequent, and also a pull-based event). By going to 1:1, they dodge this issue; but in so doing, they increase threads overhead to that of processes: threads become a mechanism for sharing some resources: the moral equivalent of the old rfork()/sfork() system calls for heap and descriptor space sharing, with seperate stacks. The procfs issue is really a red herring; specifically, it should have been possible to change the procfs interface so that 100,000 PIDs had a log2(100,000)+1, O(1), lookup time - LL[1]. The kernel space overhead that they describe with regard to the user space threads scheduler context switch is predicated on the idea of a context switch for a segmentation register, %FS or %GS, which can not be set in user space, being a protected resource, and so requires a kernel call in order to provide access for the implementation of thread local storage chosen. I am not personally familiar with the FreeBSD scontext(2) call that has been discussed as an implementation detail with regard to some of the recent postings by Daniel Eischen, in which he implied Jonathan Mini will be implementing the call. Perhaps I should have been watching this more closely, so that I could comment more intelligently on what it's supposed to do for the FreeBSD implementation which can not be done without: the Linux people may have a valid argument here which FreeBSD should heed. However... such a call is necessary to deal with register windows on SPARC processors, in any case, and the implementation is moot by way of the fact that trading a user context switch plus a kernel call for a kernel context switch hardly reduces overhead. I think the "the Linux scheduler is O(1), so we don't have to care" argument is invalid; O(1) means that for N entries, there need to be N elements traversed 1 time. What this really means is that, as the number of schedulable entitites goes up, the time required goes up linearly, as opposed to going up exponentially; or, better, to *not* going up in the first place. The signal handling argument is rather specious. The first thing that should be pointed out in this regard is that the overhead involved in signal processing being funneled through a single thread presumes an implementation of a monitor thread, which is not necessarily a requirement. But even granting that argument for a naieve implementation, signals are, in a threaded application, exceptional conditions: they are not intended to be used as IPC mechanisms, and, in fact, the POSIX definition of the interaction of signals and threads grants sufficient lattitude that they are not in fact useful for this purpose in portable software. Instead, POSIX defines a large number of new facilities to be used for IPC in a threaded environment, such that signals are not necessary. The second thing that should be pointed out is that delivery of signals is to a particular *kernel* thread. While this means that deliver to user space, it does not mean delivery to a specific user space thread, which may then be a bottleneck; rather it means delivery in order to communicate the signal to the user space scheduler -- from which it may then be appropriately dispatched. So in the N:M case, if N >> M, the relative overhead may approach this overhead, the common implementation is the have M ~= the total number of simultaneous blocking contexts, plus one. So while it's true that N != M, it *isn't* true that N ~= M, and it isn't true that N >> M. I think that FreeBSD can learn from the Linux analysis of the signal handling, but that it should *not* learn the implementation detail of the 1:1 relationship between kernel and user threads. The paper outlines almost all of the important issues in signal handling, but that does not necessarily mean that the solution chosen is the most appropriate solution for FreeBSD. The "helper thread" is a concept which first appeared in the SVR4 derived Solaris ~2.4 era, and was echoed into the SVR4.2 threads implementation. The need for this thread is an artifact of a limited kernel upcall capability, with the need for the user space to execute code on behalf of kernel events. I agree that a "helper thread" can be a significant bottleneck; but a design need not utilize a helper thread. The main purpose of such a thread is to intermediate in the creation and completion of other threads. The reason this is necessary is the fact that the kernel is permitted to block operations in times of low resources, and one of these operations so blocked may be the creation of new threads, etc.. The problem which this is solving is the lay of an asynchronous means of posting a request to the kernel, yet continuing to run in user space, despite the fact that the request posted may not complete immediately. In other words, it is a call conversion mechanism intended to deal with the fact that certain calls are blocking calls which have no asychronous counterparts for use in call conversion: NB: For the uninitiated: call conversion is a means of implementing a threads system, by trading a blocking call for a non-blocking call plus a context switch; obviously, this does not work, if there is no non-blocking equivalent for the blocking call in question. FreeBSD recently had to deal with this issue with the sendfile(2) system call, which has poor semantics relative to the use of socket descriptors that have been marked as non-blocking. Alternative ways of dealing with this problem are an asynchronous call gate (which is the most general possible mechanism, but needs a new system call entry point), and scheduler activations. In other words, getting rid of the "helper thread" does not require the 1:1 model adoption, as the paper implies. However, the paper *also* implies that it is necessary to go to this model to avoid the serialization (though it does not come out and say this explicitly); this implication is incorrect. Editorial Note: I will note for the record that FreeBSD is now considering an alternate system call entry point for future work, to permit the ABI to change over time for the 64 bit conversion, so that the prior ABI can be dropped at some point in the future. The desire to *NOT do this* was what lead to the choice of scheduler activations instead of an async call gate mechanism in FreeBSD's KSE implementation. I would *STRONGLY* recommend that, should a new ABI system call entry point come to pass, that it include an additional initial parameter for an async call context, which, if NULL, is then treated as a request for the call to be synchronous. In section 5.4, they argue traversal overhead. They argue that the list of threads must be traversed by the "helper thread" when a process exits, and that having the kernel do this "if process exist" [SIC - should be "exits"] is somehow less trouble -- but a traversal is a traversal. Eventually, they conclude that the list must remain for fork(2) handling in any case. FreeBSD could learn from a number of optimizations in this area; I personally do not like the generation count approach, and would want to use the space for a reverse pointer list linkage instead (as one example), but the point about the places there are overhead is a valid one. The "futex" implementation is a good idea. I have long been an advocate against the use of recursive locking, because, IMO, it makes programmers lazy, and leads to extra locking which should be avoided. It also has significantly higher overhead, compared to non-reentrant versions. I dislike locking overhead, particularly when it's gratuitous (IMO). FreeBSD could learn a lot here. I like the DTV optimization in the memory allocation arena. I've advocated a similar approach in the VFS, where the vnode and the per VFS data structure would be allocated at the same time, with the vnode ownership belonging to the VFS, rather than being a global system resource that's contended among all VFS's. The freelist on deallocation is an obvious optimization. They imply the existance of a high watermark facility for the freelist being needed, but don't specify whether or not this is actually implemented. IMO, this is mostly an issue for test programs, and for changing load characteristics for a single machine (e.g. an HTTP server that gets hit very hard, starts 100,000 threads, and then doesn't free the memory back to the system, and then is expected to act as a database server, instead, and needs to reassign resources -- particularly memory -- to the new use). In practice, this is probably not an isue, but it's probably a nice benchmark optimization that isn't really otherwise harmful for performance on real loads, so it's something to consider. As for the kernel improvements, FreeBSD could learn optimizations from them. I don't think that this is a critical area for 5.0, but it's probably something that needs to be addressed for 5.X, going forward. The forking optimizations are particularly salient, I think. The signal handling is, I think, self-serving: it's a strawman that justifies the design decisions they've already made. As I pointed out already, signals are not the preferred IPC mechanism, in any sense, in the POSIX threads environment; I will go further: I claim that an interactive program which must deal with STOP/CONT will in fact almost never have more that 5-9 threads, based on the fact that it's interacting with a human which (as an average) can only keep 5-9 things in short term memory at a time... contexts above and behyond that in an interactive program are simply bad design, and even if we increase that number to 20, it's not a significant source of overhead. Most of the other changes are what I would class as houskeeping tasks. One exception is the use "futex" wakeup in order to improve thread joins: FreeBSD should look closely at this. All in all, quite the interesting paper. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message