From owner-freebsd-hackers@FreeBSD.ORG Sat Feb 11 13:35:16 2012 Return-Path: Delivered-To: freebsd-hackers@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 482D9106566C; Sat, 11 Feb 2012 13:35:16 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id D8B288FC12; Sat, 11 Feb 2012 13:35:14 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id PAA00209; Sat, 11 Feb 2012 15:35:12 +0200 (EET) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1RwD6e-000DGf-8D; Sat, 11 Feb 2012 15:35:12 +0200 Message-ID: <4F366E8F.9060207@FreeBSD.org> Date: Sat, 11 Feb 2012 15:35:11 +0200 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:10.0) Gecko/20120202 Thunderbird/10.0 MIME-Version: 1.0 To: Alexander Motin References: <4F2F7B7F.40508@FreeBSD.org> In-Reply-To: <4F2F7B7F.40508@FreeBSD.org> X-Enigmail-Version: 1.3.5 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-hackers@FreeBSD.org Subject: Re: [RFT][patch] Scheduling for HTT and not only X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 11 Feb 2012 13:35:16 -0000 on 06/02/2012 09:04 Alexander Motin said the following: > Hi. > > I've analyzed scheduler behavior and think found the problem with HTT. SCHED_ULE > knows about HTT and when doing load balancing once a second, it does right > things. Unluckily, if some other thread gets in the way, process can be easily > pushed out to another CPU, where it will stay for another second because of CPU > affinity, possibly sharing physical core with something else without need. > > I've made a patch, reworking SCHED_ULE affinity code, to fix that: > http://people.freebsd.org/~mav/sched.htt.patch > > This patch does three things: > - Disables strict affinity optimization when HTT detected to let more > sophisticated code to take into account load of other logical core(s). > - Adds affinity support to the sched_lowest() function to prefer specified > (last used) CPU (and CPU groups it belongs to) in case of equal load. Previous > code always selected first valid CPU of evens. It caused threads migration to > lower CPUs without need. > - If current CPU group has no CPU where the process with its priority can run > now, sequentially check parent CPU groups before doing global search. That > should improve affinity for the next cache levels. Alexander, I know that you are working on improving this patch and we have already discussed some ideas via out-of-band channels. Here's some additional ideas. They are in part inspired by inspecting OpenSolaris code. Let's assume that one of the goals of a scheduler is to maximize system performance / computational throughput[*]. I think that modern SMP-aware schedulers try to employ the following two SMP-specific techniques to achieve that: - take advantage of thread-to-cache affinity to minimize "cold cache" time - distribute the threads over logical CPUs to optimize system resource usage by minimizing[**] sharing of / contention over the resources, which could be caches, instruction pipelines (for HTT threads), FPUs (for AMD Bulldozer "cores"), etc. 1. Affinity. It seems that on modern CPUs the caches are either inclusive or some smart "as if inclusive" caches. As a result, if two cores have a shared cache at any level, then it should be relatively cheap to move a thread from one core to the other. E.g. if logical CPUs P0 and P1 have private L1 and L2 caches and a shared L3 cache, then on modern processors it should be much cheaper to move a thread from P0 to P1 than to some processor P2 that doesn't share the L3 cache. If this assumption is really true, then we can track only an affinity of a thread with relation to a top level shared cache. E.g. if migration within an L3 cache is cheap, then we don't have any reason to constrain a migration scope to an L2 cache, let alone L1. 2. Balancing. I think that the current balancing code is pretty good, but can be augmented with the following: A. The SMP topology in longer term should include other important shared resources, not only caches. We already have this in some form via CG_FLAG_THREAD, which implies instruction pipeline sharing. B. Given the affinity assumptions, sched_pickcpu can pick the best CPU only among CPUs sharing a top level cache if a thread still has an affinity to it or among all CPUs otherwise. This should reduce temporary imbalances. C. I think that we should eliminate the bias in the sched_lowest() family of functions. I like how your patch started addressing this. For the cases where the hint (cg_prefer) can not be reasonably picked it should be a pseudo-random value. OpenSolaris does it the following way: http://fxr.watson.org/fxr/ident?v=OPENSOLARIS;im=10;i=CPU_PSEUDO_RANDOM Footnotes: [*] Goals of a scheduler could be controlled via policies. E.g. there could be a policy to reduce power usage. [**] Given a possibility of different policies a scheduler may want to concentrate threads. E.g. if a system has two packages with two cores each and there are two CPU-hungry threads, then the system may place them both on the same package to reduce power usage. Another interesting case is threads that share a VM space or otherwise share some non-trivial amount of memory. As you have suggested, it might make sense to concentrate those threads so that they share a cache. -- Andriy Gapon