From owner-freebsd-threads@FreeBSD.ORG Wed Jul 5 08:48:18 2006 Return-Path: X-Original-To: freebsd-threads@freebsd.org Delivered-To: freebsd-threads@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 38D5516A4DE; Wed, 5 Jul 2006 08:48:18 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.FreeBSD.org (Postfix) with ESMTP id BC0BC43D49; Wed, 5 Jul 2006 08:48:17 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id 101C146CA7; Wed, 5 Jul 2006 04:48:17 -0400 (EDT) Date: Wed, 5 Jul 2006 09:48:16 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Peter Wemm In-Reply-To: <200607041819.05510.peter@wemm.org> Message-ID: <20060705092048.P70011@fledge.watson.org> References: <20060703101554.Q26325@fledge.watson.org> <200607042204.52572.davidxu@freebsd.org> <44AAC47F.2040508@elischer.org> <200607041819.05510.peter@wemm.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Daniel Eischen , threads@freebsd.org, David Xu , Julian Elischer , freebsd-threads@freebsd.org Subject: Re: Strawman proposal: making libthr default thread implementation? X-BeenThere: freebsd-threads@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Threading on FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 05 Jul 2006 08:48:18 -0000 On Tue, 4 Jul 2006, Peter Wemm wrote: > Because Linux was the most widely and massively deployed threading system > out there, people tended to write (or modify) their applications to work > best with those assumptions. ie: keep pthread mutex blocking to an absolute > minimum, and not care about kernel blocking. > > However, with the SA/KSE model, our tradeoffs are different. We implement > pthread mutex blocking more quickly (except for UTS bugs that can make it > far slower), but we make blocking in kernel context significantly higher > cost than the 1:1 case, probably as much as double the cost. For > applications that block in the kernel a lot instead of on mutexes, this is a > big source of pain. > > When most of the applications that we're called to run are written with the > linux behavior in mind, when our performance is compared against linux we're > the ones that usually come off the worst. The problem I've been running into is similar but different. The reason for my asking about libthr being the default is that, in practice, our performance optimization advice for a host of threaded applications has been "Switch to libthr". This causes quite a bit of complexity from a network stack optimization perspective, because the behavior of threading in threaded network/IPC applications changes enormously if the threading model is changed. As a result, the optimization strategies differ greatly. To motivate this, let me give you an example. Widely distributed MySQL benchmarks are basically kernel IPC benchmarks, and on multi-processor systems, this means they basically benchmark context switch, scheduling, network stack overhead, and network stack parallelism. However, the locking hot spots differ significantly based on the threading model used. There are two easily identified reasons for this: - Libpthread "rate limits" threads entering the kernel in the run/running state, resulting in less contention on per-process sleep mutexes. - Libthr has greater locality of behavior in that the mapping of thread activities to kernel-visible threads is greater. Consider the case of an application that makes frequent short accesses to file descriptors -- for example, by sending lots of short I/Os on a set of UNIX domain sockets from various worker threads, each performing transactions on behalf of a client via IPC. This is, FYI, a widely deployed programming approach, and is not limited to MySQL. The various user threads will be constantly looking up file descriptor numbers in the file descriptor array; often, the same thread will look up the same number several times (accept, i/o, i/o, i/o, ..., close). This results in very high contention on the file descriptor array mutex, even though individual uses are short. In practice, libpthread sees somewhat lower contention, because in the presence of adaptive mutexes, kernel threads spin rather than blocking, causing libpthread to not push further threads in to contend on the lock. However, one of the more interesting optimizations to explore involves "loaning" file descriptors to threads, in order to take advantage of locality of reference, where repeated access to the same fd is cheaper, but revocation of the loan for use by another thread is more expensive. In libthr, we have lots of locality of reference, because user threads map 1:1 to kernel threads; in libpthread, this is not the case, as user threads float across pthreads, and even if they do get mapped to the same kernel thread repeatedly, their execution in the presence of blocking is discontinuous in the same kernel thread. This makes things tricky for someone working on reducing contention in the kernel as the number of threads increases: do I optimize for libpthread, which offers little or no locality of reference with respect to mapping user thread behavior to kernel threads, or do I optimize for libthr, which offers high locality of reference? Since our stock advice is to run libthr for high performance applications, the design choice should be clear: I should optimize for libthr. However, in doing so, I would likely heavily pessimize libpthread performance, as I would basically guarantee that heuristics based on user thread locality would fail with moderate frequency, as the per-kernel thread working set for kernel objects is significantly greater. FWIW, you can quite clearly measure the difference in file descriptor array lock contention using the http/httpd micro-benchmarks in src/tools/tools/netrate. If you run without threading, performance is better, in significant part because there is much less contention. This is an interesting, and apparently counter-intuitive observation: many people believe that the reduced context switch and greater cache locality of threaded applications always results in improved performance. This is not true for a number of important workloads -- by operating with more shared data structures, contention on those shared data structures is increased, reducing performance. Moving to the two threading models, you see markedly better libpthread performance under extremely high load involving many threads with small transactions, as libpthread provides heuristically better management of kernel load. This advantage does not carry over to real-world application loads, however, which tend to use a smaller thread worker pools with sequences of locality-rich transaction, which is why libthr performs btter as the workload approaches real-world conditions. This micro-benchmark makes for quite an interesting study piece, as you can easily vary the thread/proc model, the number of workers, and the transaction size, giving pretty clear performance curves to compare. Anyhow, my main point in raising this thread was actually oriented entirely on the initial observation, which is that in practice, we find ourselves telling people who care about performance to use libthr. If our advice is always "use libthr instead of the default", that suggests we have a problem with the default. Switching the default requires an informed decision: what do we lose, not just what do we gain. Dan has now answered this question -- we lose support for a number of realtime scheduling primitives if we switch today without further work. I think the discussion of the future of M:N support is also critical, though, as it has an immediate impact on kernel optimization strategies, especially as number of CPUs grows. In case anyone failed to notice, it's now possible to buy hardware with 32 "threads" for <$10,000, and the future appears relatively clear -- parallelism isn't just for high-end servers, it now appears in off-the-shelf notebook hardware, and appears to be the way that vendors are going to continue to improve performance. Having spent the last five years working on threading and SMP, we're well-placed to be to support this hardware, but it requires us to start consolidating our gains now, which means deciding what the baseline is for optimization when it comes to threaded applications. Robert N M Watson Computer Laboratory University of Cambridge