Date: Mon, 06 Feb 2012 20:08:00 +0100 From: Florian Smeets <flo@FreeBSD.org> To: davidxu@FreeBSD.org Cc: freebsd-hackers@FreeBSD.org, Alexander Motin <mav@FreeBSD.org>, David Xu <listlog2011@gmail.com> Subject: Re: [RFT][patch] Scheduling for HTT and not only Message-ID: <4F302510.70106@FreeBSD.org> In-Reply-To: <4F2F886F.1070706@gmail.com> References: <4F2F7B7F.40508@FreeBSD.org> <4F2F8405.2040103@gmail.com> <4F2F84E3.60809@FreeBSD.org> <4F2F886F.1070706@gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enigCFA22324C169EDAF5C2AFA44 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 06.02.12 08:59, David Xu wrote: > On 2012/2/6 15:44, Alexander Motin wrote: >> On 06.02.2012 09:40, David Xu wrote: >>> On 2012/2/6 15:04, Alexander Motin wrote: >>>> Hi. >>>> >>>> I've analyzed scheduler behavior and think found the problem with HT= T. >>>> SCHED_ULE knows about HTT and when doing load balancing once a secon= d, >>>> it does right things. Unluckily, if some other thread gets in the wa= y, >>>> process can be easily pushed out to another CPU, where it will stay >>>> for another second because of CPU affinity, possibly sharing physica= l >>>> core with something else without need. >>>> >>>> I've made a patch, reworking SCHED_ULE affinity code, to fix that: >>>> http://people.freebsd.org/~mav/sched.htt.patch >>>> >>>> This patch does three things: >>>> - Disables strict affinity optimization when HTT detected to let mor= e >>>> sophisticated code to take into account load of other logical core(s= ). >>> Yes, the HTT should first be skipped, looking up in upper layer to fi= nd >>> a more idling physical core. At least, if system is a dual-core, >>> 4-thread CPU, >>> and if there are two busy threads, they should be run on different >>> physical cores. >>> >>>> - Adds affinity support to the sched_lowest() function to prefer >>>> specified (last used) CPU (and CPU groups it belongs to) in case of >>>> equal load. Previous code always selected first valid CPU of evens. = It >>>> caused threads migration to lower CPUs without need. >>> >>> Even some level of imbalance can be borne, until it exceeds a thresho= ld, >>> this at least does not trash other cpu's cache, pushing a new thread >>> to another cpu trashes its cache. The cpus and groups can be arranged= in >>> a circle list, so searching a lowest load cpu always starts from righ= t >>> neighborhood to tail, then circles from head to left neighborhood. >>> >>>> - If current CPU group has no CPU where the process with its priorit= y >>>> can run now, sequentially check parent CPU groups before doing globa= l >>>> search. That should improve affinity for the next cache levels. >>>> >>>> I've made several different benchmarks to test it, and so far result= s >>>> look promising: >>>> - On Atom D525 (2 physical cores + HTT) I've tested HTTP receive wit= h >>>> fetch and FTP transmit with ftpd. On receive I've got 103MB/s on >>>> interface; on transmit somewhat less -- about 85MB/s. In both cases >>>> scheduler kept interrupt thread and application on different physica= l >>>> cores. Without patch speed fluctuating about 103-80MB/s on receive a= nd >>>> is about 85MB/s on transmit. >>>> - On the same Atom I've tested TCP speed with iperf and got mostly t= he >>>> same results: >>>> - receive to Atom with patch -- 755-765Mbit/s, without patch -- >>>> 531-765Mbit/s. >>>> - transmit from Atom in both cases 679Mbit/s. >>>> Fluctuating receive behavior in both tests I think can be explained = by >>>> some heavy callout handled by the swi4:clock process, called on >>>> receive (seen in top and schedgraph), but not on transmit. May be it= >>>> is specifics of the Realtek NIC driver. >>>> >>>> - On the same Atom tested number of 512 byte reads from SSD with dd = in >>>> 1 and 32 streams. Found no regressions, but no benefits also as with= >>>> one stream there is no congestion and with multiple streams all core= s >>>> congested. >>>> >>>> - On Core i7-2600K (4 physical cores + HTT) I've run more then 20 >>>> `make buildworld`s with different -j values (1,2,4,6,8,12,16) for bo= th >>>> original and patched kernel. I've found no performance regressions, >>>> while for -j4 I've got 10% improvement: >>>> # ministat -w 65 res4A res4B >>>> x res4A >>>> + res4B >>>> +-----------------------------------------------------------------+ >>>> |+ | >>>> |++ x x x| >>>> |A| |______M__A__________| | >>>> +-----------------------------------------------------------------+ >>>> N Min Max Median Avg Stddev >>>> x 3 1554.86 1617.43 1571.62 1581.3033 32.389449 >>>> + 3 1420.69 1423.1 1421.36 1421.7167 1.2439587 >>>> Difference at 95.0% confidence >>>> -159.587 =C2=B1 51.9496 >>>> -10.0921% =C2=B1 3.28524% >>>> (Student's t, pooled s =3D 22.9197) >>>> , and for -j6 -- 3.6% improvement: >>>> # ministat -w 65 res6A res6B >>>> x res6A >>>> + res6B >>>> +-----------------------------------------------------------------+ >>>> | + | >>>> | + + x x x | >>>> ||_M__A___| |__________A____M_____|| >>>> +-----------------------------------------------------------------+ >>>> N Min Max Median Avg Stddev >>>> x 3 1381.17 1402.94 1400.3 1394.8033 11.880372 >>>> + 3 1340.4 1349.34 1341.23 1343.6567 4.9393758 >>>> Difference at 95.0% confidence >>>> -51.1467 =C2=B1 20.6211 >>>> -3.66694% =C2=B1 1.47842% >>>> (Student's t, pooled s =3D 9.09782) >>>> >>>> Who wants to do independent testing to verify my results or do some >>>> more interesting benchmarks? :) >>>> >>>> PS: Sponsored by iXsystems, Inc. >>>> >>> The benchmark is incomplete, a complete benchmark should at lease >>> includes cpu intensive applications. >>> Testing for release world databases and web servers and other importa= nce >>> applications is needed. >> >> I plan to do this, but you may help. ;) >> > Thanks, I need to find time. I have cc'ed hackers@, my first mail seems= > forgot to include it. I think designing a SMP scheduler is a dirty work= , > many test and refining and still, you may get imperfect result. ;-) >=20 Here are my tests for PostgreSQL (i still use r229659 as the baseline was taken with that revision) This is on a 2x4 core, no HTT box. Max throughput is at 10 threads, so that is what i used for ministat. x 229659 + 229659+mav-ule +---------------------------------------------------------------------+ | + x | |+ + + * x+xx x + x + +x x +x| | |__________________|______A__________A____M__M_____|____| | +---------------------------------------------------------------------+ N Min Max Median Avg Stdd= ev x 10 49647.932 50376.405 50194.668 50093.065 240.472= 36 + 10 49482.234 50359.181 50159.422 49936.298 341.255= 92 No difference proven at 95.0% confidence All the numbers are here https://docs.google.com/spreadsheet/ccc?key=3D0Ai0N1xDe3uNAdDRxcVFiYjNMSn= JWOTZhUWVWWlBlemc&hl=3Den_US#gid=3D4 I'll update the pbzip2 tab in the document later today. Florian --------------enigCFA22324C169EDAF5C2AFA44 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- iEYEARECAAYFAk8wJRAACgkQapo8P8lCvwmzSwCg4+M+ApTZXYeQ7+YWcxwVzcKK At0AoNkfPcjB7wR5WuNvnfXJuHN7Yqcy =QR1N -----END PGP SIGNATURE----- --------------enigCFA22324C169EDAF5C2AFA44--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4F302510.70106>