Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 06 Feb 2012 20:08:00 +0100
From:      Florian Smeets <flo@FreeBSD.org>
To:        davidxu@FreeBSD.org
Cc:        freebsd-hackers@FreeBSD.org, Alexander Motin <mav@FreeBSD.org>, David Xu <listlog2011@gmail.com>
Subject:   Re: [RFT][patch] Scheduling for HTT and not only
Message-ID:  <4F302510.70106@FreeBSD.org>
In-Reply-To: <4F2F886F.1070706@gmail.com>
References:  <4F2F7B7F.40508@FreeBSD.org> <4F2F8405.2040103@gmail.com> <4F2F84E3.60809@FreeBSD.org> <4F2F886F.1070706@gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enigCFA22324C169EDAF5C2AFA44
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On 06.02.12 08:59, David Xu wrote:
> On 2012/2/6 15:44, Alexander Motin wrote:
>> On 06.02.2012 09:40, David Xu wrote:
>>> On 2012/2/6 15:04, Alexander Motin wrote:
>>>> Hi.
>>>>
>>>> I've analyzed scheduler behavior and think found the problem with HT=
T.
>>>> SCHED_ULE knows about HTT and when doing load balancing once a secon=
d,
>>>> it does right things. Unluckily, if some other thread gets in the wa=
y,
>>>> process can be easily pushed out to another CPU, where it will stay
>>>> for another second because of CPU affinity, possibly sharing physica=
l
>>>> core with something else without need.
>>>>
>>>> I've made a patch, reworking SCHED_ULE affinity code, to fix that:
>>>> http://people.freebsd.org/~mav/sched.htt.patch
>>>>
>>>> This patch does three things:
>>>> - Disables strict affinity optimization when HTT detected to let mor=
e
>>>> sophisticated code to take into account load of other logical core(s=
).
>>> Yes, the HTT should first be skipped, looking up in upper layer to fi=
nd
>>> a more idling physical core. At least, if system is a dual-core,
>>> 4-thread CPU,
>>> and if there are two busy threads, they should be run on different
>>> physical cores.
>>>
>>>> - Adds affinity support to the sched_lowest() function to prefer
>>>> specified (last used) CPU (and CPU groups it belongs to) in case of
>>>> equal load. Previous code always selected first valid CPU of evens. =
It
>>>> caused threads migration to lower CPUs without need.
>>>
>>> Even some level of imbalance can be borne, until it exceeds a thresho=
ld,
>>> this at least does not trash other cpu's cache, pushing a new thread
>>> to another cpu trashes its cache. The cpus and groups can be arranged=
 in
>>> a circle list, so searching a lowest load cpu always starts from righ=
t
>>> neighborhood to tail, then circles from head to left neighborhood.
>>>
>>>> - If current CPU group has no CPU where the process with its priorit=
y
>>>> can run now, sequentially check parent CPU groups before doing globa=
l
>>>> search. That should improve affinity for the next cache levels.
>>>>
>>>> I've made several different benchmarks to test it, and so far result=
s
>>>> look promising:
>>>> - On Atom D525 (2 physical cores + HTT) I've tested HTTP receive wit=
h
>>>> fetch and FTP transmit with ftpd. On receive I've got 103MB/s on
>>>> interface; on transmit somewhat less -- about 85MB/s. In both cases
>>>> scheduler kept interrupt thread and application on different physica=
l
>>>> cores. Without patch speed fluctuating about 103-80MB/s on receive a=
nd
>>>> is about 85MB/s on transmit.
>>>> - On the same Atom I've tested TCP speed with iperf and got mostly t=
he
>>>> same results:
>>>> - receive to Atom with patch -- 755-765Mbit/s, without patch --
>>>> 531-765Mbit/s.
>>>> - transmit from Atom in both cases 679Mbit/s.
>>>> Fluctuating receive behavior in both tests I think can be explained =
by
>>>> some heavy callout handled by the swi4:clock process, called on
>>>> receive (seen in top and schedgraph), but not on transmit. May be it=

>>>> is specifics of the Realtek NIC driver.
>>>>
>>>> - On the same Atom tested number of 512 byte reads from SSD with dd =
in
>>>> 1 and 32 streams. Found no regressions, but no benefits also as with=

>>>> one stream there is no congestion and with multiple streams all core=
s
>>>> congested.
>>>>
>>>> - On Core i7-2600K (4 physical cores + HTT) I've run more then 20
>>>> `make buildworld`s with different -j values (1,2,4,6,8,12,16) for bo=
th
>>>> original and patched kernel. I've found no performance regressions,
>>>> while for -j4 I've got 10% improvement:
>>>> # ministat -w 65 res4A res4B
>>>> x res4A
>>>> + res4B
>>>> +-----------------------------------------------------------------+
>>>> |+ |
>>>> |++ x x x|
>>>> |A| |______M__A__________| |
>>>> +-----------------------------------------------------------------+
>>>> N Min Max Median Avg Stddev
>>>> x 3 1554.86 1617.43 1571.62 1581.3033 32.389449
>>>> + 3 1420.69 1423.1 1421.36 1421.7167 1.2439587
>>>> Difference at 95.0% confidence
>>>> -159.587 =C2=B1 51.9496
>>>> -10.0921% =C2=B1 3.28524%
>>>> (Student's t, pooled s =3D 22.9197)
>>>> , and for -j6 -- 3.6% improvement:
>>>> # ministat -w 65 res6A res6B
>>>> x res6A
>>>> + res6B
>>>> +-----------------------------------------------------------------+
>>>> | + |
>>>> | + + x x x |
>>>> ||_M__A___| |__________A____M_____||
>>>> +-----------------------------------------------------------------+
>>>> N Min Max Median Avg Stddev
>>>> x 3 1381.17 1402.94 1400.3 1394.8033 11.880372
>>>> + 3 1340.4 1349.34 1341.23 1343.6567 4.9393758
>>>> Difference at 95.0% confidence
>>>> -51.1467 =C2=B1 20.6211
>>>> -3.66694% =C2=B1 1.47842%
>>>> (Student's t, pooled s =3D 9.09782)
>>>>
>>>> Who wants to do independent testing to verify my results or do some
>>>> more interesting benchmarks? :)
>>>>
>>>> PS: Sponsored by iXsystems, Inc.
>>>>
>>> The benchmark is incomplete, a complete benchmark should at lease
>>> includes cpu intensive applications.
>>> Testing for release world databases and web servers and other importa=
nce
>>> applications is needed.
>>
>> I plan to do this, but you may help. ;)
>>
> Thanks, I need to find time. I have cc'ed hackers@, my first mail seems=

> forgot to include it. I think designing a SMP scheduler is a dirty work=
,
> many test and refining and still, you may get imperfect result. ;-)
>=20

Here are my tests for PostgreSQL (i still use r229659 as the baseline
was taken with that revision) This is on a 2x4 core, no HTT box. Max
throughput is at 10 threads, so that is what i used for ministat.

x 229659
+ 229659+mav-ule
+---------------------------------------------------------------------+
|                                                        +       x    |
|+  +     +   *                 x+xx          x     +  x + +x    x  +x|
|         |__________________|______A__________A____M__M_____|____|   |
+---------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stdd=
ev
x  10     49647.932     50376.405     50194.668     50093.065     240.472=
36
+  10     49482.234     50359.181     50159.422     49936.298     341.255=
92
No difference proven at 95.0% confidence

All the numbers are here
https://docs.google.com/spreadsheet/ccc?key=3D0Ai0N1xDe3uNAdDRxcVFiYjNMSn=
JWOTZhUWVWWlBlemc&hl=3Den_US#gid=3D4

I'll update the pbzip2 tab in the document later today.

Florian


--------------enigCFA22324C169EDAF5C2AFA44
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----

iEYEARECAAYFAk8wJRAACgkQapo8P8lCvwmzSwCg4+M+ApTZXYeQ7+YWcxwVzcKK
At0AoNkfPcjB7wR5WuNvnfXJuHN7Yqcy
=QR1N
-----END PGP SIGNATURE-----

--------------enigCFA22324C169EDAF5C2AFA44--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4F302510.70106>