Date: Wed, 16 Dec 1998 03:04:31 +0000 (GMT) From: Terry Lambert <tlambert@primenet.com> To: vanmaren@fast.cs.utah.edu (Kevin Van Maren) Cc: smp@FreeBSD.ORG Subject: Re: Pthreads and SMP Message-ID: <199812160304.UAA13431@usr05.primenet.com> In-Reply-To: <199812151632.JAA26636@fast.cs.utah.edu> from "Kevin Van Maren" at Dec 15, 98 09:32:51 am
next in thread | previous in thread | raw e-mail | index | archive | help
> That aside, it is most certainly desirable to be able to run > multiple threads in parallel. The extent to which user threads > are mapped onto processors is best controlled by some provided > mechanism (such as pthread_setconcurrency and pthread_getconcurrency) > rather than an inflexible policy such as "I believe it may be > slow to run multiple threads at the same time". These two interfaces are optional in a conforming pthreads. > As for Terry's beef about the page table, I don't know how often > a typical app gets its page table updated, but I wouldn't think > that would be common except when a) you are paging and other > performance penalties are likely to be in the noise or b) more > memory is being allocated/accessed by the process. It is only > necessary to do a TLB-shootdown when restricting the mappings. > It isn't a problems if a processor takes a trap because its TLB > was out of date and the page is really valid; it simply loads > the new info and continues. The problem is write faults on pages that are in the page tables of both processors because there is only one page table for all of the threads in a single process. This means that you need to invalidate or update cache contents on CPUs athat are, in fact, not *using* the cache contents and could care less. This typically happens in copy on write faults and on stack growth when you hit a guard page, especially if you are passing the addresses of auto variable between threads. It's my personal experience that people use threads because they don't know how to program effectively, and using threads efficiently requires a lot of saftey harness code in the OS for these people to actually gain the benefit they think they will gain. That said, all I was trying to point out are that there are constraints on the efficiency of kernel threads that no one has addressed up to this point except Sun Microsystems, and even then, I think they screwed up the quantum model pretty badly (if the scheduler gives me a quantum, it's *my* damn quantum, and if the scheduler will take it away from me for making a blocking system call, then *screw* the scheduler, I won't make blocking calls). The name of the game is to minimize context switch overhead. > I believe we should add the necessary mechanisms to run threads > in parallel and THEN look at the actual performance problems > and address them. These mechanisms already exist, as has been pointed out countless times on this list. They just aren't packaged up with their glue code in a nice "pthread_create" routine some place, because doing that without further kernel support would result in abysmal performance, and would, in fact, be counter-productive. The main reason for the poor performance are the issues I've outlined here. You can very easily go to the -current list archives and search for "John Dyson" and get a copy of the glue code. Or you could directly ask jmb@freebsd.org for the code. > If that means the scheduler needs to be improved, fine, we improve > the scheduler. If some applications run slower on multiple > processors, we just have them call pthread_setconcurrency(1). Shoot, > Terry can make the default to be 1 on his machines. Personally, I > would like to be able to use pthread_create() instead of fork() to > handle computation-bound requests. Then feel free to integrate John's vfork based kernel threading into your libc_r, and to add the appropriate pthread_setconcurrency() functions to bring the implementation up to some documented standard, instead of teh limbo between Draft 4 and Draft 10 where it currently lives. There's really nothing stopping you from using the code; it was posted to the list. It's just that it would be real silly to abandon a working Draft 10 (standard) pthreads to chase after what some people in this thread are claiming is a computational holy grail, and which others in this thread have already had experienvce with on Solaris 2.3 and below and SVR4.0.2 and UnixWare 2.x. I can tell you: it's not even a grail-shaped beacon. > Terry has a point about wanting to design the system to be > fast from the beginning. That is almost certainly better > than to design something, realize it is way to slow, and then > hack on it forever. However, having this working in the > short term and rewriting it for the long term doesn't upset > me too much -- I just want it working, and it will certainly > be good enough for a large range of applications (even it > it isn't large enough or good enough for Terry). It won't be better than a user space call conversion scheduler for the vast majority of threaded applications. I've been through the benchmarks on the code that was posted to -current and on the similar SVR4 N:N kernel threading model and the SVR4 M:N, M>N "lets starve all the user space threads from getting quantum" Solaris 2.3 and UnixWare 2.x model. It's not a question of "not good enough", it's a question of "if you aren't intending on following the implementation through to completion, there's no reason to bother starting down the road at all". I guess what I'm trying to point out is that there is a crisis of commitment; merely having kernel threads won't make the code go faster. SMP scalability is not merely the ability to block between threads waiting on each others resources in the Big Giant Lock(tm) in the kernel. You can block on the User Space Call Conversion Scheduler(tm) instead, and achieve exactly the same (lack of) effect. If you are truly interested in pursing SMP scalability via kernel threads, the way to do it is to take the Dyson vfork() code, run it on your own machine, and work up from there. Insisting that FreeBSD commit from a non-SMP scalable call conversion model in favor of a non-SMP scalable context switch and cache busting kernel threading model is not the way to go. To get anywhere with that argument, you are going to have to be able to beat the user space threads with your kernel space threads on a uniprocessor system, and show imporvement, or at least no degradation, on an SMP system from using the user space scheduler. Kernel threads context switches are *not* lighter weight than process context switches. The cost is about equal, unless you have some way of assuring CPU <-> threads group affinity. Even then, you are talking about starving other processes in favor of the thread group unless you are very, very careful implementing your code. The only place this won't be true is a rigged benchmark on an otherwise idle machine, such that your benchmark process never has to compete with any other process, and so the page table is never pushed out to make way for the page table for "init" or "syncd" or "nfsiod", etc., etc.. This is not a trivial problem to solve, and hitting the currently limping-but-functional user space threads over the head with a shovel and dragging the body away to make room for a different and limping-but-even-less-functional new body won't cut it. Ugh. I need to drag my IEEE SMP and parallel processing literature out of my back bedroom when I get home tonight, I guess. Then I'll be able to quote you chapter and verse. 8-(. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-smp" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199812160304.UAA13431>