From owner-freebsd-hackers@FreeBSD.ORG  Mon Feb  6 19:18:33 2012
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 41EBC106564A;
	Mon,  6 Feb 2012 19:18:33 +0000 (UTC)
	(envelope-from mavbsd@gmail.com)
Received: from mail-ey0-f182.google.com (mail-ey0-f182.google.com
	[209.85.215.182])
	by mx1.freebsd.org (Postfix) with ESMTP id A20EB8FC17;
	Mon,  6 Feb 2012 19:18:31 +0000 (UTC)
Received: by eaan10 with SMTP id n10so3074695eaa.13
	for <multiple recipients>; Mon, 06 Feb 2012 11:18:31 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject
	:references:in-reply-to:content-type:content-transfer-encoding;
	bh=Pl49uzHMKUiFK0WUMRrVco4Q2PibPDnfyhxTQovTE1Y=;
	b=dA7stBhmOQalYvwZHWwZPY8VfoQyDssxQzB2nwh1gkqFnmN2Va1kffjf5cYmXC9FAC
	tXBP8X6MIp/68WDNKLTYXT4cUIpWWM8+zglUhUvSXDc171IAJHjXzhqJvcuyx7fxF/X4
	v/I5WdKEcazdfklvA2S1mPbj73QP2sFxdxFyE=
Received: by 10.213.112.200 with SMTP id x8mr1304509ebp.37.1328555911288;
	Mon, 06 Feb 2012 11:18:31 -0800 (PST)
Received: from mavbook2.mavhome.dp.ua (pc.mavhome.dp.ua. [212.86.226.226])
	by mx.google.com with ESMTPS id o49sm64052276eeb.7.2012.02.06.11.18.29
	(version=SSLv3 cipher=OTHER); Mon, 06 Feb 2012 11:18:30 -0800 (PST)
Sender: Alexander Motin <mavbsd@gmail.com>
Message-ID: <4F302784.3090607@FreeBSD.org>
Date: Mon, 06 Feb 2012 21:18:28 +0200
From: Alexander Motin <mav@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
	rv:9.0) Gecko/20111227 Thunderbird/9.0
MIME-Version: 1.0
To: Florian Smeets <flo@FreeBSD.org>
References: <4F2F7B7F.40508@FreeBSD.org> <4F2F8405.2040103@gmail.com>
	<4F2F84E3.60809@FreeBSD.org> <4F2F886F.1070706@gmail.com>
	<4F302510.70106@FreeBSD.org>
In-Reply-To: <4F302510.70106@FreeBSD.org>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Cc: freebsd-hackers@FreeBSD.org, davidxu@FreeBSD.org
Subject: Re: [RFT][patch] Scheduling for HTT and not only
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 06 Feb 2012 19:18:33 -0000

On 02/06/12 21:08, Florian Smeets wrote:
> On 06.02.12 08:59, David Xu wrote:
>> On 2012/2/6 15:44, Alexander Motin wrote:
>>> On 06.02.2012 09:40, David Xu wrote:
>>>> On 2012/2/6 15:04, Alexander Motin wrote:
>>>>> Hi.
>>>>>
>>>>> I've analyzed scheduler behavior and think found the problem with HTT.
>>>>> SCHED_ULE knows about HTT and when doing load balancing once a second,
>>>>> it does right things. Unluckily, if some other thread gets in the way,
>>>>> process can be easily pushed out to another CPU, where it will stay
>>>>> for another second because of CPU affinity, possibly sharing physical
>>>>> core with something else without need.
>>>>>
>>>>> I've made a patch, reworking SCHED_ULE affinity code, to fix that:
>>>>> http://people.freebsd.org/~mav/sched.htt.patch
>>>>>
>>>>> This patch does three things:
>>>>> - Disables strict affinity optimization when HTT detected to let more
>>>>> sophisticated code to take into account load of other logical core(s).
>>>> Yes, the HTT should first be skipped, looking up in upper layer to find
>>>> a more idling physical core. At least, if system is a dual-core,
>>>> 4-thread CPU,
>>>> and if there are two busy threads, they should be run on different
>>>> physical cores.
>>>>
>>>>> - Adds affinity support to the sched_lowest() function to prefer
>>>>> specified (last used) CPU (and CPU groups it belongs to) in case of
>>>>> equal load. Previous code always selected first valid CPU of evens. It
>>>>> caused threads migration to lower CPUs without need.
>>>>
>>>> Even some level of imbalance can be borne, until it exceeds a threshold,
>>>> this at least does not trash other cpu's cache, pushing a new thread
>>>> to another cpu trashes its cache. The cpus and groups can be arranged in
>>>> a circle list, so searching a lowest load cpu always starts from right
>>>> neighborhood to tail, then circles from head to left neighborhood.
>>>>
>>>>> - If current CPU group has no CPU where the process with its priority
>>>>> can run now, sequentially check parent CPU groups before doing global
>>>>> search. That should improve affinity for the next cache levels.
>>>>>
>>>>> I've made several different benchmarks to test it, and so far results
>>>>> look promising:
>>>>> - On Atom D525 (2 physical cores + HTT) I've tested HTTP receive with
>>>>> fetch and FTP transmit with ftpd. On receive I've got 103MB/s on
>>>>> interface; on transmit somewhat less -- about 85MB/s. In both cases
>>>>> scheduler kept interrupt thread and application on different physical
>>>>> cores. Without patch speed fluctuating about 103-80MB/s on receive and
>>>>> is about 85MB/s on transmit.
>>>>> - On the same Atom I've tested TCP speed with iperf and got mostly the
>>>>> same results:
>>>>> - receive to Atom with patch -- 755-765Mbit/s, without patch --
>>>>> 531-765Mbit/s.
>>>>> - transmit from Atom in both cases 679Mbit/s.
>>>>> Fluctuating receive behavior in both tests I think can be explained by
>>>>> some heavy callout handled by the swi4:clock process, called on
>>>>> receive (seen in top and schedgraph), but not on transmit. May be it
>>>>> is specifics of the Realtek NIC driver.
>>>>>
>>>>> - On the same Atom tested number of 512 byte reads from SSD with dd in
>>>>> 1 and 32 streams. Found no regressions, but no benefits also as with
>>>>> one stream there is no congestion and with multiple streams all cores
>>>>> congested.
>>>>>
>>>>> - On Core i7-2600K (4 physical cores + HTT) I've run more then 20
>>>>> `make buildworld`s with different -j values (1,2,4,6,8,12,16) for both
>>>>> original and patched kernel. I've found no performance regressions,
>>>>> while for -j4 I've got 10% improvement:
>>>>> # ministat -w 65 res4A res4B
>>>>> x res4A
>>>>> + res4B
>>>>> +-----------------------------------------------------------------+
>>>>> |+ |
>>>>> |++ x x x|
>>>>> |A| |______M__A__________| |
>>>>> +-----------------------------------------------------------------+
>>>>> N Min Max Median Avg Stddev
>>>>> x 3 1554.86 1617.43 1571.62 1581.3033 32.389449
>>>>> + 3 1420.69 1423.1 1421.36 1421.7167 1.2439587
>>>>> Difference at 95.0% confidence
>>>>> -159.587 ± 51.9496
>>>>> -10.0921% ± 3.28524%
>>>>> (Student's t, pooled s = 22.9197)
>>>>> , and for -j6 -- 3.6% improvement:
>>>>> # ministat -w 65 res6A res6B
>>>>> x res6A
>>>>> + res6B
>>>>> +-----------------------------------------------------------------+
>>>>> | + |
>>>>> | + + x x x |
>>>>> ||_M__A___| |__________A____M_____||
>>>>> +-----------------------------------------------------------------+
>>>>> N Min Max Median Avg Stddev
>>>>> x 3 1381.17 1402.94 1400.3 1394.8033 11.880372
>>>>> + 3 1340.4 1349.34 1341.23 1343.6567 4.9393758
>>>>> Difference at 95.0% confidence
>>>>> -51.1467 ± 20.6211
>>>>> -3.66694% ± 1.47842%
>>>>> (Student's t, pooled s = 9.09782)
>>>>>
>>>>> Who wants to do independent testing to verify my results or do some
>>>>> more interesting benchmarks? :)
>>>>>
>>>>> PS: Sponsored by iXsystems, Inc.
>>>>>
>>>> The benchmark is incomplete, a complete benchmark should at lease
>>>> includes cpu intensive applications.
>>>> Testing for release world databases and web servers and other importance
>>>> applications is needed.
>>>
>>> I plan to do this, but you may help. ;)
>>>
>> Thanks, I need to find time. I have cc'ed hackers@, my first mail seems
>> forgot to include it. I think designing a SMP scheduler is a dirty work,
>> many test and refining and still, you may get imperfect result. ;-)
>>
>
> Here are my tests for PostgreSQL (i still use r229659 as the baseline
> was taken with that revision) This is on a 2x4 core, no HTT box. Max
> throughput is at 10 threads, so that is what i used for ministat.
>
> x 229659
> + 229659+mav-ule
> +---------------------------------------------------------------------+
> |                                                        +       x    |
> |+  +     +   *                 x+xx          x     +  x + +x    x  +x|
> |         |__________________|______A__________A____M__M_____|____|   |
> +---------------------------------------------------------------------+
>      N           Min           Max        Median           Avg        Stddev
> x  10     49647.932     50376.405     50194.668     50093.065     240.47236
> +  10     49482.234     50359.181     50159.422     49936.298     341.25592
> No difference proven at 95.0% confidence
>
> All the numbers are here
> https://docs.google.com/spreadsheet/ccc?key=0Ai0N1xDe3uNAdDRxcVFiYjNMSnJWOTZhUWVWWlBlemc&hl=en_US#gid=4
>
> I'll update the pbzip2 tab in the document later today.

I'm sorry, but I think you can put this on pause for a moment. After 
some tests with MySQL (where I've found 3% regression), new feedback and 
more thinking I have a wish to try rewrite the patch. I'll probably send 
new one to test next days.

Thank you for your help.

-- 
Alexander Motin