From owner-freebsd-hackers@FreeBSD.ORG Mon Feb 6 19:18:33 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 41EBC106564A; Mon, 6 Feb 2012 19:18:33 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-ey0-f182.google.com (mail-ey0-f182.google.com [209.85.215.182]) by mx1.freebsd.org (Postfix) with ESMTP id A20EB8FC17; Mon, 6 Feb 2012 19:18:31 +0000 (UTC) Received: by eaan10 with SMTP id n10so3074695eaa.13 for ; Mon, 06 Feb 2012 11:18:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=Pl49uzHMKUiFK0WUMRrVco4Q2PibPDnfyhxTQovTE1Y=; b=dA7stBhmOQalYvwZHWwZPY8VfoQyDssxQzB2nwh1gkqFnmN2Va1kffjf5cYmXC9FAC tXBP8X6MIp/68WDNKLTYXT4cUIpWWM8+zglUhUvSXDc171IAJHjXzhqJvcuyx7fxF/X4 v/I5WdKEcazdfklvA2S1mPbj73QP2sFxdxFyE= Received: by 10.213.112.200 with SMTP id x8mr1304509ebp.37.1328555911288; Mon, 06 Feb 2012 11:18:31 -0800 (PST) Received: from mavbook2.mavhome.dp.ua (pc.mavhome.dp.ua. [212.86.226.226]) by mx.google.com with ESMTPS id o49sm64052276eeb.7.2012.02.06.11.18.29 (version=SSLv3 cipher=OTHER); Mon, 06 Feb 2012 11:18:30 -0800 (PST) Sender: Alexander Motin Message-ID: <4F302784.3090607@FreeBSD.org> Date: Mon, 06 Feb 2012 21:18:28 +0200 From: Alexander Motin User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:9.0) Gecko/20111227 Thunderbird/9.0 MIME-Version: 1.0 To: Florian Smeets References: <4F2F7B7F.40508@FreeBSD.org> <4F2F8405.2040103@gmail.com> <4F2F84E3.60809@FreeBSD.org> <4F2F886F.1070706@gmail.com> <4F302510.70106@FreeBSD.org> In-Reply-To: <4F302510.70106@FreeBSD.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Cc: freebsd-hackers@FreeBSD.org, davidxu@FreeBSD.org Subject: Re: [RFT][patch] Scheduling for HTT and not only X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 06 Feb 2012 19:18:33 -0000 On 02/06/12 21:08, Florian Smeets wrote: > On 06.02.12 08:59, David Xu wrote: >> On 2012/2/6 15:44, Alexander Motin wrote: >>> On 06.02.2012 09:40, David Xu wrote: >>>> On 2012/2/6 15:04, Alexander Motin wrote: >>>>> Hi. >>>>> >>>>> I've analyzed scheduler behavior and think found the problem with HTT. >>>>> SCHED_ULE knows about HTT and when doing load balancing once a second, >>>>> it does right things. Unluckily, if some other thread gets in the way, >>>>> process can be easily pushed out to another CPU, where it will stay >>>>> for another second because of CPU affinity, possibly sharing physical >>>>> core with something else without need. >>>>> >>>>> I've made a patch, reworking SCHED_ULE affinity code, to fix that: >>>>> http://people.freebsd.org/~mav/sched.htt.patch >>>>> >>>>> This patch does three things: >>>>> - Disables strict affinity optimization when HTT detected to let more >>>>> sophisticated code to take into account load of other logical core(s). >>>> Yes, the HTT should first be skipped, looking up in upper layer to find >>>> a more idling physical core. At least, if system is a dual-core, >>>> 4-thread CPU, >>>> and if there are two busy threads, they should be run on different >>>> physical cores. >>>> >>>>> - Adds affinity support to the sched_lowest() function to prefer >>>>> specified (last used) CPU (and CPU groups it belongs to) in case of >>>>> equal load. Previous code always selected first valid CPU of evens. It >>>>> caused threads migration to lower CPUs without need. >>>> >>>> Even some level of imbalance can be borne, until it exceeds a threshold, >>>> this at least does not trash other cpu's cache, pushing a new thread >>>> to another cpu trashes its cache. The cpus and groups can be arranged in >>>> a circle list, so searching a lowest load cpu always starts from right >>>> neighborhood to tail, then circles from head to left neighborhood. >>>> >>>>> - If current CPU group has no CPU where the process with its priority >>>>> can run now, sequentially check parent CPU groups before doing global >>>>> search. That should improve affinity for the next cache levels. >>>>> >>>>> I've made several different benchmarks to test it, and so far results >>>>> look promising: >>>>> - On Atom D525 (2 physical cores + HTT) I've tested HTTP receive with >>>>> fetch and FTP transmit with ftpd. On receive I've got 103MB/s on >>>>> interface; on transmit somewhat less -- about 85MB/s. In both cases >>>>> scheduler kept interrupt thread and application on different physical >>>>> cores. Without patch speed fluctuating about 103-80MB/s on receive and >>>>> is about 85MB/s on transmit. >>>>> - On the same Atom I've tested TCP speed with iperf and got mostly the >>>>> same results: >>>>> - receive to Atom with patch -- 755-765Mbit/s, without patch -- >>>>> 531-765Mbit/s. >>>>> - transmit from Atom in both cases 679Mbit/s. >>>>> Fluctuating receive behavior in both tests I think can be explained by >>>>> some heavy callout handled by the swi4:clock process, called on >>>>> receive (seen in top and schedgraph), but not on transmit. May be it >>>>> is specifics of the Realtek NIC driver. >>>>> >>>>> - On the same Atom tested number of 512 byte reads from SSD with dd in >>>>> 1 and 32 streams. Found no regressions, but no benefits also as with >>>>> one stream there is no congestion and with multiple streams all cores >>>>> congested. >>>>> >>>>> - On Core i7-2600K (4 physical cores + HTT) I've run more then 20 >>>>> `make buildworld`s with different -j values (1,2,4,6,8,12,16) for both >>>>> original and patched kernel. I've found no performance regressions, >>>>> while for -j4 I've got 10% improvement: >>>>> # ministat -w 65 res4A res4B >>>>> x res4A >>>>> + res4B >>>>> +-----------------------------------------------------------------+ >>>>> |+ | >>>>> |++ x x x| >>>>> |A| |______M__A__________| | >>>>> +-----------------------------------------------------------------+ >>>>> N Min Max Median Avg Stddev >>>>> x 3 1554.86 1617.43 1571.62 1581.3033 32.389449 >>>>> + 3 1420.69 1423.1 1421.36 1421.7167 1.2439587 >>>>> Difference at 95.0% confidence >>>>> -159.587 ± 51.9496 >>>>> -10.0921% ± 3.28524% >>>>> (Student's t, pooled s = 22.9197) >>>>> , and for -j6 -- 3.6% improvement: >>>>> # ministat -w 65 res6A res6B >>>>> x res6A >>>>> + res6B >>>>> +-----------------------------------------------------------------+ >>>>> | + | >>>>> | + + x x x | >>>>> ||_M__A___| |__________A____M_____|| >>>>> +-----------------------------------------------------------------+ >>>>> N Min Max Median Avg Stddev >>>>> x 3 1381.17 1402.94 1400.3 1394.8033 11.880372 >>>>> + 3 1340.4 1349.34 1341.23 1343.6567 4.9393758 >>>>> Difference at 95.0% confidence >>>>> -51.1467 ± 20.6211 >>>>> -3.66694% ± 1.47842% >>>>> (Student's t, pooled s = 9.09782) >>>>> >>>>> Who wants to do independent testing to verify my results or do some >>>>> more interesting benchmarks? :) >>>>> >>>>> PS: Sponsored by iXsystems, Inc. >>>>> >>>> The benchmark is incomplete, a complete benchmark should at lease >>>> includes cpu intensive applications. >>>> Testing for release world databases and web servers and other importance >>>> applications is needed. >>> >>> I plan to do this, but you may help. ;) >>> >> Thanks, I need to find time. I have cc'ed hackers@, my first mail seems >> forgot to include it. I think designing a SMP scheduler is a dirty work, >> many test and refining and still, you may get imperfect result. ;-) >> > > Here are my tests for PostgreSQL (i still use r229659 as the baseline > was taken with that revision) This is on a 2x4 core, no HTT box. Max > throughput is at 10 threads, so that is what i used for ministat. > > x 229659 > + 229659+mav-ule > +---------------------------------------------------------------------+ > | + x | > |+ + + * x+xx x + x + +x x +x| > | |__________________|______A__________A____M__M_____|____| | > +---------------------------------------------------------------------+ > N Min Max Median Avg Stddev > x 10 49647.932 50376.405 50194.668 50093.065 240.47236 > + 10 49482.234 50359.181 50159.422 49936.298 341.25592 > No difference proven at 95.0% confidence > > All the numbers are here > https://docs.google.com/spreadsheet/ccc?key=0Ai0N1xDe3uNAdDRxcVFiYjNMSnJWOTZhUWVWWlBlemc&hl=en_US#gid=4 > > I'll update the pbzip2 tab in the document later today. I'm sorry, but I think you can put this on pause for a moment. After some tests with MySQL (where I've found 3% regression), new feedback and more thinking I have a wish to try rewrite the patch. I'll probably send new one to test next days. Thank you for your help. -- Alexander Motin