From owner-freebsd-hackers@FreeBSD.ORG Mon Feb 13 22:38:51 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DF473106566B; Mon, 13 Feb 2012 22:38:50 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-ey0-f182.google.com (mail-ey0-f182.google.com [209.85.215.182]) by mx1.freebsd.org (Postfix) with ESMTP id 03EBA8FC12; Mon, 13 Feb 2012 22:38:49 +0000 (UTC) Received: by eaan10 with SMTP id n10so2205637eaa.13 for ; Mon, 13 Feb 2012 14:38:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=z0zHdNnZnLxcOhLfLuHEEmvP85/IjSYf2McGAwAAKko=; b=q6+kM/sR+81Pmo2wt3eJetxzOY4pUUAFW6P8bOsuy7KVop63b5PkZahKF8kkQsGBb/ 0Hi+khLowFVZBp7ZDRE02XCf2qwCterMgahsdAKXsC6aNPqAV1uDhaO6t2Ib/fQ+hGZD 0xihfG+sJzUoktNWgvO0yDheTYt9dm/b7Lb7c= Received: by 10.14.13.206 with SMTP id b54mr5756203eeb.30.1329172728819; Mon, 13 Feb 2012 14:38:48 -0800 (PST) Received: from mavbook.mavhome.dp.ua (pc.mavhome.dp.ua. [212.86.226.226]) by mx.google.com with ESMTPS id z47sm65801692eeh.9.2012.02.13.14.38.46 (version=SSLv3 cipher=OTHER); Mon, 13 Feb 2012 14:38:47 -0800 (PST) Sender: Alexander Motin Message-ID: <4F3990EA.1080002@FreeBSD.org> Date: Tue, 14 Feb 2012 00:38:34 +0200 From: Alexander Motin User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:9.0) Gecko/20120116 Thunderbird/9.0 MIME-Version: 1.0 To: Jeff Roberson References: <4F2F7B7F.40508@FreeBSD.org> <4F366E8F.9060207@FreeBSD.org> <4F367965.6000602@FreeBSD.org> <4F396B24.5090602@FreeBSD.org> <4F3978BC.6090608@FreeBSD.org> In-Reply-To: Content-Type: text/plain; charset=KOI8-R; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-hackers@FreeBSD.org, Florian Smeets , Andriy Gapon Subject: Re: [RFT][patch] Scheduling for HTT and not only X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 13 Feb 2012 22:38:51 -0000 On 13.02.2012 23:39, Jeff Roberson wrote: > On Mon, 13 Feb 2012, Alexander Motin wrote: >> On 02/13/12 22:23, Jeff Roberson wrote: >>> On Mon, 13 Feb 2012, Alexander Motin wrote: >>> >>>> On 02/11/12 16:21, Alexander Motin wrote: >>>>> I've heavily rewritten the patch already. So at least some of the >>>>> ideas >>>>> are already addressed. :) At this moment I am mostly satisfied with >>>>> results and after final tests today I'll probably publish new version. >>>> >>>> It took more time, but finally I think I've put pieces together: >>>> http://people.freebsd.org/~mav/sched.htt23.patch >>> >>> I need some time to read and digest this. However, at first glance, a >>> global pickcpu lock will not be acceptable. Better to make a rarely >>> imperfect decision than too often cause contention. >> >> On my tests it was opposite. Imperfect decisions under 60K MySQL >> requests per second on 8 cores quite often caused two threads to be >> pushed to one CPU or to one physical core, causing up to 5-10% >> performance penalties. I've tried both with and without lock and at >> least on 8-core machine difference was significant to add this. I >> understand that this is not good, but I have no machine with hundred >> of CPUs to tell how will it work there. For really big systems it >> could be partitioned somehow, but that will also increase load imbalance. > > It would be preferable to refetch the load on the target cpu and restart > the selection if it has changed. Even this should have some maximum > bound on the number of times it will spin and possibly be conditionally > enabled. That two cpus are making the same decision indicates that the > race window is occuring and contention will be guaranteed. As you have > tested on only 8 cores that's not a good sign. Race window there exists by definition. as the code is not locked. The fact that we hitting it often may mean just that some other locks (may be in application, may be ours) cause that synchronization. Using almost equal requests in benchmark also did not increase randomness. What's about rechecking load -- that is done as you may see to protect from fast paths, as that works. But there is time window between check and putting request on the queue. Used lock doesn't fix it completely, but significantly reduce changes. >>>> The patch is more complicated then previous one both logically and >>>> computationally, but with growing CPU power and complexity I think we >>>> can possibly spend some more time deciding how to spend time. :) >>> >>> It is probably worth more cycles but we need to evaluate this much more >>> complex algorithm carefully to make sure that each of these new features >>> provides an advantage. >> >> Problem is that doing half of things may not give full picture. How to >> do affinity trying to save some percents, while SMT effect is times >> higher? Same time too many unknown variables in applications behavior >> can easily make all of this pointless. >> >>>> Patch formalizes several ideas of the previous code about how to >>>> select CPU for running a thread and adds some new. It's main idea is >>>> that I've moved from comparing raw integer queue lengths to >>>> higher-resolution flexible values. That additional 8-bit precision >>>> allows same time take into account many factors affecting performance. >>>> Beside just choosing best from equally-loaded CPUs, with new code it >>>> may even happen that because of SMT, cache affinity, etc, CPU with >>>> more threads on it's queue will be reported as less loaded and >>>> opposite. >>>> >>>> New code takes into account such factors: >>>> - SMT sharing penalty. >>>> - Cache sharing penalty. >>>> - Cache affinity (with separate coefficients for last-level and other >>>> level caches) to the: >>> >>> We already used separate affinity values for different cache levels. >>> Keep in mind that if something else has run on a core the cache affinity >>> is lost in very short order. Trying too hard to preserve it beyond a few >>> ms never seems to pan out. >> >> Previously it was only about timeout, that was IMHO pointless, as it >> is impossible to predict when cache will be purged. It could be done >> in microsecond or second later, depending on application behavior. > > This was not pointless. Eliminate it and see. The point is that after > some time has elapsed the cache is almost certainly useless and we > should select the most appropriate cpu based on load, priority, etc. We > don't have perfect information for any of these algorithms. But as an > approximation it is useful to know whether affinity should even be > considered. An improvement on this would be to look at the amount of > time the core has been idle since the selecting thread last ran rather > than just the current load. Tell me what the point of selecting for > affinity is if so much time has passed that valid cache contents are > almost guaranteed to be lost? I am not telling/going to keep affinity forever. You may see that I am also setting limit on affinity time. What's IMHO pointless is trying to set expiration time to 1/2/3ms for L1/2/3 caches. These numbers doesn't mean anything real. What I was saying is that I've differentiated affinity _force_ for the last level cache and closer levels. With "last" I don't mean L2 or L3 specifically, but last before bus/memory, that links several cores together. >>>> - other running threads of it's process, >>> >>> This is not really a great indicator of whether things should be >>> scheduled together or not. What workload are you targeting here? >> >> When several threads accessing/modifying same shared memory. Like >> MySQL server threads. I've noticed that on Atom CPU wit no L3 it is >> cheaper to move two threads to one physical core to share the cache >> then handle coherency over the memory bus. > > It can definitely be cheaper. But there are an equal number of cases > where it will be more expensive. Some applications will have a lot of > contention and shared state and these will want to be co-located. Others > will simply want to get as much cache and cpu time as they can. There > are a number of papers that have been published on determining which is > which based on cpu performance counters. I believe sun does this in > particular. Another option that apple has pursued is to give the > application the option to mark threads as wanting to be close together > or far away. > > I think the particular heuristic you have here is too expensive and > specific to go in. The potential negative consequences are very big. If > you want to pursue apple or sun's approach to this problem I would be > interested in that. I am completely agree that using performance counters here is overkill. But if applications could somehow tell us how would they like to be scheduled, that would be much more realistic. >>>> - previous CPU where it was running, >>>> - current CPU (usually where it was called from). >>> >>> These two were also already used. Additionally: >>> >>> + * Hide part of the current thread >>> + * load, hoping it or the scheduled >>> + * one complete soon. >>> + * XXX: We need more stats for this. >>> >>> I had something like this before. Unfortunately interactive tasks are >>> allowed fairly aggressive bursts of cpu to account for things like xorg >>> and web browsers. Also, I tried this for ithreads but they can be very >>> expensive in some workloads so other cpus will idle as you try to >>> schedule behind an ithread. >> >> As I have noted, this need more precise statistics about thread >> behavior. Present sampled statistics is almost useless there. Existing >> code always prefers to run thread on current CPU if there is no other >> CPU with no load. That logic works very good when 8 MySQL threads and >> 8 clients working on 8 CPUs, but a bit not so good in other situations. > > You're speaking of the stathz based accounting? Or you want more precise > stats about other things? We've talked for years about event based > accounting rather than sampling but no one has implemented it. Please go > ahead if you would like. Keep in mind that cores can change frequency > and tsc values may not be stable. Yes, I am speaking about thread run/idle times that could be improved with event-based accounting. > However, even with perfect stats, I'm not sure whether ignoring the > current load will be the right thing. I had some changes that took the > interactivity score into account to do this. If it is very very low, > then maybe it makes sense. That is what I was talking about. I was trying to avoid case when very short events push some other heavy thread into very uncomfortable position, where it will stay for a long period, while they themselves will complete in a moment. >>>> All of these factors are configurable via sysctls, but I think >>>> reasonable defaults should fit most. >>>> >>>> Also, comparing to previous patch, I've resurrected optimized shortcut >>>> in CPU selection for the case of SMT. Comparing to original code >>>> having problems with this, I've added check for other logical cores >>>> load that should make it safe and still very fast when there are less >>>> running threads then physical cores. >>>> >>>> I've tested in on Core i7 and Atom systems, but more interesting would >>>> be to test it on multi-socket system with properly detected topology >>>> to check benefits from affinity. >>>> >>>> At this moment the main issue I see is that this patch affects only >>>> time when thread is starting. If thread runs continuously, it will >>>> stay where it was, even if due to situation change that is not very >>>> effective (causes SMT sharing, etc). I haven't looked much on periodic >>>> load balancer yet, but probably it could also be somehow improved. >>>> >>>> What is your opinion, is it too over-engineered, or it is the right >>>> way to go? >>> >>> I think it's a little too much change all at once. I also believe that >>> the changes that try very hard to preserve affinity likely help a much >>> smaller number of cases than they hurt. I would prefer you do one piece >>> at a time and validate each step. There are a lot of good ideas in here >>> but good ideas don't always turn into results. >> >> When each of these small steps can change everything and they are >> related, number of combinations to test grows rapidly. I am not going >> to commit this tomorrow. It is more like concept, that needs testing >> and evaluation. > > I say this because I have tried nearly all of these heuristics in > different forms. I don't object to the general idea of using a weighted > score to select the target cpu. However, I do think several of these > heuristics are problematic. While the current algorithm is far from > perfect it is the product of an incredible amount of testing and > experimentation. Significant changes to it are going to require an equal > amount of effort to characterize and verify. And I do believe many > pieces can be broken down and tested independently. For example, whether > to ignore interactive load on the core, or whether to lock pickcpu, etc. > can all easily be independently tested in a number of workloads. > > Do you intend to clean up and commit your last, simpler patch? I have no > objections to that and it simply fixes a bias in the load selection > algorithm that shouldn't have existed. I see no much point in committing them sequentially, as they are quite orthogonal. I need to make one decision. I am going on small vacation next week. It will give time for thoughts to settle. May be I indeed just clean previous patch a bit and commit it when I get back. I've spent too much time trying to make these things formal and so far results are not bad, but also not so brilliant as I would like. May be it is indeed time to step back and try some more simple solution. -- Alexander Motin