From owner-freebsd-arch Fri Sep 20 23:53:19 2002 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 588A237B401 for ; Fri, 20 Sep 2002 23:53:14 -0700 (PDT) Received: from pintail.mail.pas.earthlink.net (pintail.mail.pas.earthlink.net [207.217.120.122]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8231D43E7B for ; Fri, 20 Sep 2002 23:53:13 -0700 (PDT) (envelope-from tlambert2@mindspring.com) Received: from pool0018.cvx40-bradley.dialup.earthlink.net ([216.244.42.18] helo=mindspring.com) by pintail.mail.pas.earthlink.net with esmtp (Exim 3.33 #1) id 17se8U-0006Ug-00; Fri, 20 Sep 2002 23:53:03 -0700 Message-ID: <3D8C170E.A84E922B@mindspring.com> Date: Fri, 20 Sep 2002 23:51:58 -0700 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Rik van Riel Cc: "Bill Huey (Hui)" , Julian Elischer , freebsd-arch@freebsd.org Subject: Re: New Linux threading model References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Rik van Riel wrote: > On Fri, 20 Sep 2002, Terry Lambert wrote: > > In the design I'm talking about, there are is only a single lock, > > and that lock is on a migration queue, into which you *push* > > entries. This is different from both the Linux scheduler, and the > > current FreeBSD scheduler, and the current FreeBSD scheduler with > > Alfred's old (last year) affinity patches applied. > > Very nice design, IF the cost of contention is higher than > the cost of having CPUs sit idle until the next rescheduling > happens on one of the busier CPUs... > > Say that you have a moderately loaded box with, on average, > about as many runnable tasks as you have CPUs. These tasks > will, due to statistics, not always be evenly distributed > across CPUs. This will never happen in the real world, where the steady state of system load (as a measure of the number of processes, on average, in the ready-to-run state) is > 1. 8-). It also makes a really serious assumption about the number of CPUs, or a seriously invalid one about what the average load of a desktop or internet server sits at. But say we accept that, for the sake of argument, even though we know that threads will drastically inflate the number of things in "ready to run" state, even if we decide to call those things something other than "processes". > Say that distribution is near-perfect and this results in > only 1% CPU idle time. > > Your scheme would win ONLY if the lock contention in the > pull model is responsible for more than that 1% CPU time. Not true. There is a common myth in the x86 world: it's that shared memory multiprocessors based on the x86 architecture stop scaling after 4 CPUs. In point of fact, Sequent built 32 processor 486, and later, Pentium systems, which were shared memory multiprocessors. They scaled significantly beyond that point, no problem. I use to program applications on these beasts, and the 68K based ones as well (i.e. both Sequent Symmetry and Sequent Balance machines). So what's the barrier in the SVR4 and Solaris -- and Linux, and FreeBSD -- OS architectures that makes everyone believe and propagate this myth? It all comes down to contention for shared resources. And these days, that boils down to, in increasing order of contention: o TLB/L1 cache coherency o L2 cache memory o System memory o The I/O bus(ses) ...it's really, really important to remember that the "I" in "MESI" stands for "Invalidate". 8-). > If we can keep the CPUs busier by pulling tasks onto idle > CPUs and the locking overhead is less than 1%, it's a win. With respect, most modern systems are never CPU-bound these days; when they *are* CPU bound, it's because there is a stall barrier that introduced by an I/O or main memory access. Consider a very fast Intel system these days; it's common to find machines that run at ~2.1GHz. The fastest front-side memory bus in common use is 433MHz -- one fifth of that speed. Now that a simple clock -- a compare and exchange; you are talking a factor of either 10 or 15 on the access, vs. the instructions used in the access, because there is a barrier that requires that the MESI cache coherency is mantained on the lock contents: 5 in, 5, out, and (maybe) another 5 in. *Maybe*, if you limited yourself to non-shared-memory systems, you can get the speed win that you wanted; that basically means "no hyperthreading", or it means "NUMA". Now let's "solve" the general SMP shared memory multiprocessor scaling problem, in its entirety. The normal way to solve this is: don't share the memory. A difference way of putting this is "A resource which is not shared, is a resource which is not contended". How do you achieve this virtual state, on real hardware, when the real hardware *does* share resources? The easy way to do this is to break up ownership of the system resources so that the contention is minimized. This means: o Eliminate locking from common non-failure code paths. o Eliminate resource contention by assigning resources to per processor pools, which can be refilled from, and flushed to, contended system-wide pools -- *only* when necessary. o Eliminate locking from failure code paths o Eliminate cache discard (TLB shootdown, L1 and L2 cache invalidation, reloading of data from devices, if the data is already in memory) o Assign data channels to specific CPUs, rather than running in virtual wire mode for interrupt processing, or transiently assigning interrupts to a single CPU simply because it has acquired a global lock o Enforce cache locality (common code path coherency, cache coloring, etc.; Ingo Molnar, one of the authors of the paper cited earlier, has done this for the Tux in-kernel HTTP server, so that the common code path all fits in cache: TCP stack, web server, and all) ...in short, pretty much everything Sequent did to Dynix, reaching a peak between 1992 and 1994 or so, and what SGI has been arguing for inclusion in Linux, based on the patch sets they've provided. It's all fine and good to argue CPU time, when there isn't a clock multiplier, or when your CPU time is your highest contended resource. But this is only true, even on CPU intensive applications, if those applications are inherently parallelizable -- the kind of applications where you can hook a bunch of toy computers together, and get the same results that you would get from a real supercomputer. Even then, it only really works if there is isolated locality of reference for the problem subsets between the CPUs: if everyone is writing data into the same page in memory, then it's going to be a contended resource, and it's going to bottleneck the computations in contention for the resource. In the common case, main memory is generally a full order of magnitude slower than L1 cache. And disk memory is a full order of magnitude slower than that. And for any network application, any disk I/O you do is going to contend with the network for I/O bus time. So, back on the "Subject:" line... The benefits of seperating the scheduler queues, without the introduction of locks, except in the exceptional code paths, and delaying that, if possible, to act as a smoothing function for transient load spikes (e.g. make the figure of merit for CPU load be a weighted moving average of the last N snapshots of the instantaneous load) ...is not the be-all, end-all of optimizations. On the other hand, it doesn't suck, and it's not something you can so easily dismiss by claiming that the CPU load is "only" 99%. In other words, we are preventing objects in one ownership domain from moving to another one, except under extraoridnary circumstances, because we know that that's where contention arises. Another part of the answer is going to a memory allocator that use per CPU pools of memory, which are allocated only rarely from the system common pool, and which only rarely span the same locality as that of another CPU... so that locks are not needed, TLB shootdown doesn't occur, and the L2 and L1 cache contents are rarely invalidated by the MMU, simply because they have been accessed by a different CPU. Again: making changes in ownership extrordinary. Yeah, there's some obvious optimizations we can get on this, too. One is to allocate the memory and other resources, as necessary, to each CPU (real or "hyperreal"), and don't give it back when the free list hits a high watermark. Instead, leave it there, until the system pool hits a low watermark, and then signal the CPUs that there is an insufficiency. This basically eliminates all the thrashing until there is a shortage -- if there ever is -- of a given contended resource. If you want to argue this, the thing to do is to write the code to ensure tunable contention levels, and to gather statistics on the real-world performance of the algorithm *under contention*; because the chance for memory contention are greater than 5 times the CPU load contention -- the average over the last 10 years has been pretty steady at 10 times, actually, and has fluctuated as high as 20 times. And the memory bus isn't the slowest point in data flow through a computer, it's I/O. So force the contention issue, and measure how the algorithm you are defending degrades. If you want to look at L2 contention directly, then a post-processor that counts the use of the LCK prefix would be useful -- it's most useful, if all your mutex code gets inlined, so you can count real occurances, rather than referenced occurrances. In most cases, if you are grabbing a lock, it's because you have failed to nail the problem, or you are working around having to rewrite all of the code. Neither one of these is actually damning, but it's good to know when you're doing it. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message