From owner-freebsd-hackers@FreeBSD.ORG Mon Feb 6 16:01:36 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: by hub.freebsd.org (Postfix, from userid 1233) id A4749106564A; Mon, 6 Feb 2012 16:01:36 +0000 (UTC) Date: Mon, 6 Feb 2012 16:01:36 +0000 From: Alexander Best To: Alexander Motin Message-ID: <20120206160136.GA35918@freebsd.org> References: <4F2F7B7F.40508@FreeBSD.org> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <4F2F7B7F.40508@FreeBSD.org> Cc: freebsd-hackers@freebsd.org Subject: Re: [RFT][patch] Scheduling for HTT and not only X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 06 Feb 2012 16:01:36 -0000 On Mon Feb 6 12, Alexander Motin wrote: > Hi. > > I've analyzed scheduler behavior and think found the problem with HTT. > SCHED_ULE knows about HTT and when doing load balancing once a second, > it does right things. Unluckily, if some other thread gets in the way, > process can be easily pushed out to another CPU, where it will stay for > another second because of CPU affinity, possibly sharing physical core > with something else without need. > > I've made a patch, reworking SCHED_ULE affinity code, to fix that: > http://people.freebsd.org/~mav/sched.htt.patch > > This patch does three things: > - Disables strict affinity optimization when HTT detected to let more > sophisticated code to take into account load of other logical core(s). > - Adds affinity support to the sched_lowest() function to prefer > specified (last used) CPU (and CPU groups it belongs to) in case of > equal load. Previous code always selected first valid CPU of evens. It > caused threads migration to lower CPUs without need. > - If current CPU group has no CPU where the process with its priority > can run now, sequentially check parent CPU groups before doing global > search. That should improve affinity for the next cache levels. > > I've made several different benchmarks to test it, and so far results > look promising: > - On Atom D525 (2 physical cores + HTT) I've tested HTTP receive with > fetch and FTP transmit with ftpd. On receive I've got 103MB/s on > interface; on transmit somewhat less -- about 85MB/s. In both cases > scheduler kept interrupt thread and application on different physical > cores. Without patch speed fluctuating about 103-80MB/s on receive and > is about 85MB/s on transmit. > - On the same Atom I've tested TCP speed with iperf and got mostly the > same results: > - receive to Atom with patch -- 755-765Mbit/s, without patch -- > 531-765Mbit/s. > - transmit from Atom in both cases 679Mbit/s. > Fluctuating receive behavior in both tests I think can be explained by > some heavy callout handled by the swi4:clock process, called on receive > (seen in top and schedgraph), but not on transmit. May be it is > specifics of the Realtek NIC driver. > > - On the same Atom tested number of 512 byte reads from SSD with dd in > 1 and 32 streams. Found no regressions, but no benefits also as with one > stream there is no congestion and with multiple streams all cores congested. > > - On Core i7-2600K (4 physical cores + HTT) I've run more then 20 > `make buildworld`s with different -j values (1,2,4,6,8,12,16) for both > original and patched kernel. I've found no performance regressions, > while for -j4 I've got 10% improvement: > # ministat -w 65 res4A res4B > x res4A > + res4B > +-----------------------------------------------------------------+ > |+ | > |++ x x x| > |A| |______M__A__________| | > +-----------------------------------------------------------------+ > N Min Max Median Avg Stddev > x 3 1554.86 1617.43 1571.62 1581.3033 32.389449 > + 3 1420.69 1423.1 1421.36 1421.7167 1.2439587 > Difference at 95.0% confidence > -159.587 ± 51.9496 > -10.0921% ± 3.28524% > (Student's t, pooled s = 22.9197) > , and for -j6 -- 3.6% improvement: > # ministat -w 65 res6A res6B > x res6A > + res6B > +-----------------------------------------------------------------+ > | + | > | + + x x x | > ||_M__A___| |__________A____M_____|| > +-----------------------------------------------------------------+ > N Min Max Median Avg Stddev > x 3 1381.17 1402.94 1400.3 1394.8033 11.880372 > + 3 1340.4 1349.34 1341.23 1343.6567 4.9393758 > Difference at 95.0% confidence > -51.1467 ± 20.6211 > -3.66694% ± 1.47842% > (Student's t, pooled s = 9.09782) > > Who wants to do independent testing to verify my results or do some more > interesting benchmarks? :) i don't have any benchmarks to offer, but i'm seeing a massive increase in responsiveness with your patch. with an unpatched kernel, opening xterm while unrar'ing some huge archive could take up to 3 minutes!!! with your patch the time it takes for xterm to start is never > 10 seconds!!! well done. :) really looking forward to seeing this commited. cheers. alex btw: i couldn't verify a decrease in my mouses input rate. nothing was lagging! however i'm not running moused(8). i can only advise anyone to turn it off in connection with usb mice. i was having massive problems with moused(8) and hald(8) (i.e. input rates < 1 Hz during heavy disk i/o). disabling moused(8) and relying on hald(8) completely (removing any mouse specific entry from my xorg.conf and disabling moused(8) in my rc.conf) solved the issue entirely. > > PS: Sponsored by iXsystems, Inc. > > -- > Alexander Motin