From owner-freebsd-hackers@FreeBSD.ORG Mon Feb 6 07:59:49 2012 Return-Path: Delivered-To: freebsd-hackers@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 818CA106564A; Mon, 6 Feb 2012 07:59:49 +0000 (UTC) (envelope-from listlog2011@gmail.com) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 6E3DB8FC08; Mon, 6 Feb 2012 07:59:49 +0000 (UTC) Received: from [127.0.0.1] (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.5/8.14.5) with ESMTP id q167xinb024264; Mon, 6 Feb 2012 07:59:46 GMT (envelope-from listlog2011@gmail.com) Message-ID: <4F2F886F.1070706@gmail.com> Date: Mon, 06 Feb 2012 15:59:43 +0800 From: David Xu User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:10.0) Gecko/20120129 Thunderbird/10.0 MIME-Version: 1.0 To: Alexander Motin References: <4F2F7B7F.40508@FreeBSD.org> <4F2F8405.2040103@gmail.com> <4F2F84E3.60809@FreeBSD.org> In-Reply-To: <4F2F84E3.60809@FreeBSD.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Cc: freebsd-hackers@FreeBSD.org, davidxu@FreeBSD.org Subject: Re: [RFT][patch] Scheduling for HTT and not only X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: davidxu@FreeBSD.org List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 06 Feb 2012 07:59:49 -0000 On 2012/2/6 15:44, Alexander Motin wrote: > On 06.02.2012 09:40, David Xu wrote: >> On 2012/2/6 15:04, Alexander Motin wrote: >>> Hi. >>> >>> I've analyzed scheduler behavior and think found the problem with HTT. >>> SCHED_ULE knows about HTT and when doing load balancing once a second, >>> it does right things. Unluckily, if some other thread gets in the way, >>> process can be easily pushed out to another CPU, where it will stay >>> for another second because of CPU affinity, possibly sharing physical >>> core with something else without need. >>> >>> I've made a patch, reworking SCHED_ULE affinity code, to fix that: >>> http://people.freebsd.org/~mav/sched.htt.patch >>> >>> This patch does three things: >>> - Disables strict affinity optimization when HTT detected to let more >>> sophisticated code to take into account load of other logical core(s). >> Yes, the HTT should first be skipped, looking up in upper layer to find >> a more idling physical core. At least, if system is a dual-core, >> 4-thread CPU, >> and if there are two busy threads, they should be run on different >> physical cores. >> >>> - Adds affinity support to the sched_lowest() function to prefer >>> specified (last used) CPU (and CPU groups it belongs to) in case of >>> equal load. Previous code always selected first valid CPU of evens. It >>> caused threads migration to lower CPUs without need. >> >> Even some level of imbalance can be borne, until it exceeds a threshold, >> this at least does not trash other cpu's cache, pushing a new thread >> to another cpu trashes its cache. The cpus and groups can be arranged in >> a circle list, so searching a lowest load cpu always starts from right >> neighborhood to tail, then circles from head to left neighborhood. >> >>> - If current CPU group has no CPU where the process with its priority >>> can run now, sequentially check parent CPU groups before doing global >>> search. That should improve affinity for the next cache levels. >>> >>> I've made several different benchmarks to test it, and so far results >>> look promising: >>> - On Atom D525 (2 physical cores + HTT) I've tested HTTP receive with >>> fetch and FTP transmit with ftpd. On receive I've got 103MB/s on >>> interface; on transmit somewhat less -- about 85MB/s. In both cases >>> scheduler kept interrupt thread and application on different physical >>> cores. Without patch speed fluctuating about 103-80MB/s on receive and >>> is about 85MB/s on transmit. >>> - On the same Atom I've tested TCP speed with iperf and got mostly the >>> same results: >>> - receive to Atom with patch -- 755-765Mbit/s, without patch -- >>> 531-765Mbit/s. >>> - transmit from Atom in both cases 679Mbit/s. >>> Fluctuating receive behavior in both tests I think can be explained by >>> some heavy callout handled by the swi4:clock process, called on >>> receive (seen in top and schedgraph), but not on transmit. May be it >>> is specifics of the Realtek NIC driver. >>> >>> - On the same Atom tested number of 512 byte reads from SSD with dd in >>> 1 and 32 streams. Found no regressions, but no benefits also as with >>> one stream there is no congestion and with multiple streams all cores >>> congested. >>> >>> - On Core i7-2600K (4 physical cores + HTT) I've run more then 20 >>> `make buildworld`s with different -j values (1,2,4,6,8,12,16) for both >>> original and patched kernel. I've found no performance regressions, >>> while for -j4 I've got 10% improvement: >>> # ministat -w 65 res4A res4B >>> x res4A >>> + res4B >>> +-----------------------------------------------------------------+ >>> |+ | >>> |++ x x x| >>> |A| |______M__A__________| | >>> +-----------------------------------------------------------------+ >>> N Min Max Median Avg Stddev >>> x 3 1554.86 1617.43 1571.62 1581.3033 32.389449 >>> + 3 1420.69 1423.1 1421.36 1421.7167 1.2439587 >>> Difference at 95.0% confidence >>> -159.587 ± 51.9496 >>> -10.0921% ± 3.28524% >>> (Student's t, pooled s = 22.9197) >>> , and for -j6 -- 3.6% improvement: >>> # ministat -w 65 res6A res6B >>> x res6A >>> + res6B >>> +-----------------------------------------------------------------+ >>> | + | >>> | + + x x x | >>> ||_M__A___| |__________A____M_____|| >>> +-----------------------------------------------------------------+ >>> N Min Max Median Avg Stddev >>> x 3 1381.17 1402.94 1400.3 1394.8033 11.880372 >>> + 3 1340.4 1349.34 1341.23 1343.6567 4.9393758 >>> Difference at 95.0% confidence >>> -51.1467 ± 20.6211 >>> -3.66694% ± 1.47842% >>> (Student's t, pooled s = 9.09782) >>> >>> Who wants to do independent testing to verify my results or do some >>> more interesting benchmarks? :) >>> >>> PS: Sponsored by iXsystems, Inc. >>> >> The benchmark is incomplete, a complete benchmark should at lease >> includes cpu intensive applications. >> Testing for release world databases and web servers and other importance >> applications is needed. > > I plan to do this, but you may help. ;) > Thanks, I need to find time. I have cc'ed hackers@, my first mail seems forgot to include it. I think designing a SMP scheduler is a dirty work, many test and refining and still, you may get imperfect result. ;-) Regards, David Xu