Date: Tue, 4 Jul 2006 18:19:04 -0700 From: Peter Wemm <peter@wemm.org> To: freebsd-threads@freebsd.org Cc: Daniel Eischen <deischen@freebsd.org>, threads@freebsd.org, Robert Watson <rwatson@freebsd.org>, Julian Elischer <julian@elischer.org>, David Xu <davidxu@freebsd.org> Subject: Re: Strawman proposal: making libthr default thread implementation? Message-ID: <200607041819.05510.peter@wemm.org> In-Reply-To: <44AAC47F.2040508@elischer.org> References: <20060703101554.Q26325@fledge.watson.org> <200607042204.52572.davidxu@freebsd.org> <44AAC47F.2040508@elischer.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tuesday 04 July 2006 12:41 pm, Julian Elischer wrote: > David Xu wrote: > >On Tuesday 04 July 2006 21:08, Daniel Eischen wrote: > >>The question was what does libthr lack. The answer is priority > >>inheritence & protect mutexes, and also SCHED_FIFO, SCHED_RR, and > >>(in the future) SCHED_SPORADIC scheduling. That is what I stated > >>earlier in this thread. > > > >As other people said, we need performance, these features, as you > >said, in the future, but I don't think it is more important than > > performance problem. you have to answer people what they should do > > when they bought two cpus but works like they only have one, as the > > major author of libpthread, in the past, you decided to keep > > silent, ignoring such requirement. also, the signal queue may not > > work reliably with libpthread, this nightmare appears again. > > As much as it pains me to say it, we could do with looking at using > the simpler mode of 1:1 > as the default. M:N does work but it turns out that many of the > promissed advantages turn out to be > phantoms due to the complexities of actually implementing it. At BSDCan, I tinkered with a checkout of the cvs tree, to see what the kernel side of things would look like if M:N support came out. The result is an amazing code clarity improvement and it enables a bunch of other optimizations to be done with greater ease. Things happen like being able to easily reduce the C code executed between an interrupt and ithread dispatch by about 75%. This simplification enabled Kip to do a bunch of scalability work as well (per-cpu scheduling locks, per-cpu process lists, etc). However, my objectives there were quite different to what Robert has raised. My objectives were a 'what if?'. People have complained in the past that the complexity that KSE adds to the kernel context switching code gets in the way of other optimizations that they'd like to try, so I figured that this would be a good way to call them on that and see if it really does help or not. I was hoping to be able to present a list of things that we'd gain as a result, but unfortunately the cat is out of a bag a bit earlier than I'd have liked. I never really intended to bring it up until there was something to show for it. I know Kip has done some amazing work already but I was hoping for other things as well before going public. FWIW, My skunkworks project is in perforce: //depot/projects/bike_sched/... and there is a live diff: http://people.freebsd.org/~peter/bike_sched.diff (Yes, the name was picked long before this thread started) It does NOT have any of Kip's optimization work in it. It was just meant as a baseline for other people to experiment with. I've tested it with 4bsd as the scheduler. ULE might work, but I have not tried it. SCHED_CORE will not compile in that tree because I haven't yet gone over diffs from David Xu yet. I run this code on my laptop with libmap.conf redirecting libpthread to libthr. It works very well for me, even threaded apps like firefox etc. Anyway, back to the subject at hand. The basic problem with the KSE/SA model as I see it (besides the kernel code complexity) is that it doesn't really seem to suit the kind of threaded applications that people seem to want to run on unix boxes. In a traditional 1:1 threading system, eg: linuxthreads/nptl, libthr, etc, mutex blocking is expensive, but system calls and blocking in kernel mode is the same cost as a regular process making system calls or blocking in kernel mode. Because Linux was the most widely and massively deployed threading system out there, people tended to write (or modify) their applications to work best with those assumptions. ie: keep pthread mutex blocking to an absolute minimum, and not care about kernel blocking. However, with the SA/KSE model, our tradeoffs are different. We implement pthread mutex blocking more quickly (except for UTS bugs that can make it far slower), but we make blocking in kernel context significantly higher cost than the 1:1 case, probably as much as double the cost. For applications that block in the kernel a lot instead of on mutexes, this is a big source of pain. When most of the applications that we're called to run are written with the linux behavior in mind, when our performance is compared against linux we're the ones that usually come off the worst. I'm sure that there are threaded applications that benefit from cheap mutex operations, but I'm not personally aware of them. I do know that the ones that we get regularly compared to linux with are the likes of mysql, squid and threaded http servers. All of those depend on kernel blocking being as fast as possible. Faster mutexes doesn't seem to compensate for the extra costs of kernel blocking. I don't know where java fits into this picture. We've proven that we can make KSE work, but it was far harder than we imagined, and unfortunately, the real-world apps that matter the most just don't seem to take advantage of it. Not to mention the complexity that we have to work around for scalability work. Speaking of scalability, 16 and 32 way systems are here already and will be common within 7.0's lifetime. If we don't scale, we're sunk. My gut tells me that we HAVE to address the complexity that the KSE kernel code adds in order to improve this. We can barely work well on 4-cpu systems, let alone 32 cpu systems. PS: I think it would be interesting to see a hybrid user level M:N system. Even if it was as simple as multiplexing user threads onto a group of kernel threads (without M:N kernel support) and doing libc_r style syscall wrappers for intercepting long-term blockable operations like socket/pipe IO etc. For short term blocking (disk IO), just wear the cost of letting one thread block for a moment. I suspect that large parts of libpthread could be reused and some bits brought back from libc_r. I think this would do a fairly decent job for things like computational threaded apps because mutexes would be really fast. PPS: My opinions are not meant as a criticism of the massive amount of work that has gone into making KSE work. It is more an attempt to step back and take an objective look at the ever-changing big picture. -- Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com "All of this is for nothing if we don't go to the stars" - JMS/B5
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200607041819.05510.peter>