From owner-freebsd-hackers Mon Feb 15 16:45:52 1999 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id QAA03426 for freebsd-hackers-outgoing; Mon, 15 Feb 1999 16:45:52 -0800 (PST) (envelope-from owner-freebsd-hackers@FreeBSD.ORG) Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id QAA03409 for ; Mon, 15 Feb 1999 16:45:43 -0800 (PST) (envelope-from tlambert@usr02.primenet.com) Received: (from daemon@localhost) by smtp04.primenet.com (8.8.8/8.8.8) id RAA10185; Mon, 15 Feb 1999 17:55:49 -0700 (MST) Received: from usr02.primenet.com(206.165.6.202) via SMTP by smtp04.primenet.com, id smtpd010120; Mon Feb 15 17:55:34 1999 Received: (from tlambert@localhost) by usr02.primenet.com (8.8.5/8.8.5) id RAA22056; Mon, 15 Feb 1999 17:45:20 -0700 (MST) From: Terry Lambert Message-Id: <199902160045.RAA22056@usr02.primenet.com> Subject: Re: Processor affinity? To: will@iki.fi (Ville-Pertti Keinonen) Date: Tue, 16 Feb 1999 00:45:20 +0000 (GMT) Cc: dyson@iquest.net, hackers@FreeBSD.ORG In-Reply-To: <864sonmqvm.fsf@not.oeno.com> from "Ville-Pertti Keinonen" at Feb 15, 99 09:03:09 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > > # of active CPU cycles. I do have affinity code for SMP, and it makes > > a positive difference in performance, even with the big-lock FreeBSD kernel. > > What does it do? > > In order to be more useful than a last-run-on hint (which is pretty > much useless for time-sharing), it needs to be able to select a > thread that doesn't have the highest priority to run when the best > processor for the highest-priority runnable (but not active) thread > is (momentarily) running a even higher-priority thread. > > Doing something like this in a BSD scheduler is a bit difficult > because of the priority buckets. It seems to me that you either > give up O(1) thread selection (the Linux folks seem to be happy > with O(n), but I don't like the idea) or have to do something > moderately complex (such as per-processor run queues with load > balancing, like DEC did with OSF/1). > > Or did you find a more elegant solution? A very trivial affinity soloution is to have per CPU scheduler queues, and to keep a "quantum count" per CPU as well. Processes becoming "ready to run" are queued on the processor with the highest quantum count (e.g., highest process turnover rate, indicating relatively higher I/O binding). Hysteresis is introduced, such that processes are "sticky"; that is, unless the quantum count differential is over a certain amount, the process "prefers" to remain on the local CPU queue. This algorithm is sufficient to ensure sufficient affinity that blatant cache busting can be mostly avoided. Note that a characteristic load that turns over 15 quanta in N time units while another characteristic load turns over 7 quanta in the same N time units is normal, natural, and not a bad thing. A second trivial modification to the algorithm attempts to delay migration. If migration occurs from a processor with M processes consuming quantum, then you will want to delay for M+1 for the quantum count to see the effect of the previous migration, before jumping into the next migration. In reality, you want M(theta) -- the M qunatum clock ticks for the most loaded CPU. > > The reason for diminishing gains on a 2 or 4 cpu system isn't bandwidth as > > much as it is the availability of fewer CPUs to choose from. By blindly > > choosing the wrong cpu, there will be much more latency assocated with > > refilling the cache on context switch. > > And with affinity, particularly if it is too strong, you'll > occasionally have far more latency associated with getting a thread > to run again when the right cpu wasn't available when the thread > would "naturally" have run. The answer is to not "blindly" choose CPU's. Process loadiing characteristics are generally metastable over time. A third affinity optimization is "negative affinity". This is a behaviour modification based on threads within a process to maximize effective concurrency. A base assumption to this technique is that the functional decomposition into threrads in the first place was done correctly by the programmer, such that the threads will have little mutex or IPC interaction and/or heap sharing, which would otherwise result in interprocessor synchronization hits. It's basically all in how you implement it. Note that schedulable entity affinity is only one possible win. As well as cache lines, there is per CPU system resource affinity (per CPU pools) that can be leverages in exactly the same way to further reduce latency. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message