Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 13 Feb 2012 22:55:24 +0200
From:      Alexander Motin <mav@FreeBSD.org>
To:        Jeff Roberson <jroberson@jroberson.net>
Cc:        freebsd-hackers@FreeBSD.org, Florian Smeets <flo@FreeBSD.org>, Andriy Gapon <avg@FreeBSD.org>
Subject:   Re: [RFT][patch] Scheduling for HTT and not only
Message-ID:  <4F3978BC.6090608@FreeBSD.org>
In-Reply-To: <alpine.BSF.2.00.1202131012270.2020@desktop>
References:  <4F2F7B7F.40508@FreeBSD.org> <4F366E8F.9060207@FreeBSD.org> <4F367965.6000602@FreeBSD.org> <4F396B24.5090602@FreeBSD.org> <alpine.BSF.2.00.1202131012270.2020@desktop>

next in thread | previous in thread | raw e-mail | index | archive | help
On 02/13/12 22:23, Jeff Roberson wrote:
> On Mon, 13 Feb 2012, Alexander Motin wrote:
>
>> On 02/11/12 16:21, Alexander Motin wrote:
>>> I've heavily rewritten the patch already. So at least some of the ideas
>>> are already addressed. :) At this moment I am mostly satisfied with
>>> results and after final tests today I'll probably publish new version.
>>
>> It took more time, but finally I think I've put pieces together:
>> http://people.freebsd.org/~mav/sched.htt23.patch
>
> I need some time to read and digest this. However, at first glance, a
> global pickcpu lock will not be acceptable. Better to make a rarely
> imperfect decision than too often cause contention.

On my tests it was opposite. Imperfect decisions under 60K MySQL 
requests per second on 8 cores quite often caused two threads to be 
pushed to one CPU or to one physical core, causing up to 5-10% 
performance penalties. I've tried both with and without lock and at 
least on 8-core machine difference was significant to add this. I 
understand that this is not good, but I have no machine with hundred of 
CPUs to tell how will it work there. For really big systems it could be 
partitioned somehow, but that will also increase load imbalance.

>> The patch is more complicated then previous one both logically and
>> computationally, but with growing CPU power and complexity I think we
>> can possibly spend some more time deciding how to spend time. :)
>
> It is probably worth more cycles but we need to evaluate this much more
> complex algorithm carefully to make sure that each of these new features
> provides an advantage.

Problem is that doing half of things may not give full picture. How to 
do affinity trying to save some percents, while SMT effect is times 
higher? Same time too many unknown variables in applications behavior 
can easily make all of this pointless.

>> Patch formalizes several ideas of the previous code about how to
>> select CPU for running a thread and adds some new. It's main idea is
>> that I've moved from comparing raw integer queue lengths to
>> higher-resolution flexible values. That additional 8-bit precision
>> allows same time take into account many factors affecting performance.
>> Beside just choosing best from equally-loaded CPUs, with new code it
>> may even happen that because of SMT, cache affinity, etc, CPU with
>> more threads on it's queue will be reported as less loaded and opposite.
>>
>> New code takes into account such factors:
>> - SMT sharing penalty.
>> - Cache sharing penalty.
>> - Cache affinity (with separate coefficients for last-level and other
>> level caches) to the:
>
> We already used separate affinity values for different cache levels.
> Keep in mind that if something else has run on a core the cache affinity
> is lost in very short order. Trying too hard to preserve it beyond a few
> ms never seems to pan out.

Previously it was only about timeout, that was IMHO pointless, as it is 
impossible to predict when cache will be purged. It could be done in 
microsecond or second later, depending on application behavior.

>> - other running threads of it's process,
>
> This is not really a great indicator of whether things should be
> scheduled together or not. What workload are you targeting here?

When several threads accessing/modifying same shared memory. Like MySQL 
server threads. I've noticed that on Atom CPU wit no L3 it is cheaper to 
move two threads to one physical core to share the cache then handle 
coherency over the memory bus.

>> - previous CPU where it was running,
>> - current CPU (usually where it was called from).
>
> These two were also already used. Additionally:
>
> + * Hide part of the current thread
> + * load, hoping it or the scheduled
> + * one complete soon.
> + * XXX: We need more stats for this.
>
> I had something like this before. Unfortunately interactive tasks are
> allowed fairly aggressive bursts of cpu to account for things like xorg
> and web browsers. Also, I tried this for ithreads but they can be very
> expensive in some workloads so other cpus will idle as you try to
> schedule behind an ithread.

As I have noted, this need more precise statistics about thread 
behavior. Present sampled statistics is almost useless there. Existing 
code always prefers to run thread on current CPU if there is no other 
CPU with no load. That logic works very good when 8 MySQL threads and 8 
clients working on 8 CPUs, but a bit not so good in other situations.

>> All of these factors are configurable via sysctls, but I think
>> reasonable defaults should fit most.
>>
>> Also, comparing to previous patch, I've resurrected optimized shortcut
>> in CPU selection for the case of SMT. Comparing to original code
>> having problems with this, I've added check for other logical cores
>> load that should make it safe and still very fast when there are less
>> running threads then physical cores.
>>
>> I've tested in on Core i7 and Atom systems, but more interesting would
>> be to test it on multi-socket system with properly detected topology
>> to check benefits from affinity.
>>
>> At this moment the main issue I see is that this patch affects only
>> time when thread is starting. If thread runs continuously, it will
>> stay where it was, even if due to situation change that is not very
>> effective (causes SMT sharing, etc). I haven't looked much on periodic
>> load balancer yet, but probably it could also be somehow improved.
>>
>> What is your opinion, is it too over-engineered, or it is the right
>> way to go?
>
> I think it's a little too much change all at once. I also believe that
> the changes that try very hard to preserve affinity likely help a much
> smaller number of cases than they hurt. I would prefer you do one piece
> at a time and validate each step. There are a lot of good ideas in here
> but good ideas don't always turn into results.

When each of these small steps can change everything and they are 
related, number of combinations to test grows rapidly. I am not going to 
commit this tomorrow. It is more like concept, that needs testing and 
evaluation.

-- 
Alexander Motin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4F3978BC.6090608>