From owner-svn-src-all@FreeBSD.ORG Wed May 27 15:21:41 2015 Return-Path: Delivered-To: svn-src-all@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 1E464B38; Wed, 27 May 2015 15:21:41 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by mx1.freebsd.org (Postfix) with ESMTP id BDCAD3B4; Wed, 27 May 2015 15:21:40 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c211-30-166-197.carlnfd1.nsw.optusnet.com.au (c211-30-166-197.carlnfd1.nsw.optusnet.com.au [211.30.166.197]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id AE9D71046EDC; Thu, 28 May 2015 01:21:31 +1000 (AEST) Date: Thu, 28 May 2015 01:21:31 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Kurt Lidl cc: Bruce Evans , Eitan Adler , Adrian Chadd , "src-committers@freebsd.org" , "svn-src-all@freebsd.org" , "svn-src-head@freebsd.org" Subject: Re: svn commit: r281103 - head/sys/amd64/amd64 In-Reply-To: <5565CC49.1020800@pix.net> Message-ID: <20150528002223.O3265@besplex.bde.org> References: <201504050518.t355IFVJ001786@svn.freebsd.org> <20150405163305.A2515@besplex.bde.org> <20150406152653.K1066@besplex.bde.org> <5565CC49.1020800@pix.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=QeFf4Krv c=1 sm=1 tr=0 a=KA6XNC2GZCFrdESI5ZmdjQ==:117 a=PO7r1zJSAAAA:8 a=kj9zAlcOel0A:10 a=JzwRw_2MAAAA:8 a=6I5d2MoRAAAA:8 a=FlN6N8JoDbLK7ldWeMkA:9 a=CjuIK1q_8ugA:10 X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 27 May 2015 15:21:41 -0000 On Wed, 27 May 2015, Kurt Lidl wrote: > On 4/6/15 1:42 AM, Bruce Evans wrote: >> On Mon, 6 Apr 2015, Eitan Adler wrote: >> >>> + a few people interested in the diff >>> >>> On 5 April 2015 at 02:55, Bruce Evans wrote: >>>> On Sun, 5 Apr 2015, Eitan Adler wrote: >>> >>> I did not confirm the performance impact, but the submitter and others >>> indicated they saw a difference. >>> >>> Do you have specific data that shows that there was an improvement? >> >> Only micro-benchmark output that indicates little difference. This >> is probably very MD (depending on write combining hardware), so you >> might only see a difference on some systems. >> >> I also have micro-benchmark output for network packets/second that >> shows 10% differences for the change of adding 1 byte of padding >> in code that is never executed. This seems to be due to different >> cache misses. To eliminate differences from this (except ones >> caused by actually running different code), create a reference >> version by padding the functions or data to be changed so that >> the change doesn't affect the address of anything except the >> internals of the changed parts. >> >> I might try a makeworld run to see if changing the non-temporal >> accesses in pagecopy and pagezero to cached. > > I ran a few (total of 12) buildworld runs after this discussion. > I finally got around to posting the results to the original bug. > > The data is here: > > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199151#c3 I can't read that, but ran many related benchmarks on a new system. Haswell CPUs have very fast "rep movsb" for large copies within the L1 cache. These run at 32 bytes/cycle. Nothing except copying through AVX registers can get anywhere near this. The next best is copying through SSE registers at 16 bytes/cycle. However, the L1 cache is not very large, and "rep movsb" has a large setup overhead -- about 23 cycles. 1 4K page is barely large enough for the setup overhead to not dominate. It takes 23 cycles to set up, then only 128 more cycles at 32 bytes/cycle to do the work. Page zeroing and copying is rarely within the L1 cache. Within the L2 cache, the speed of "rep movsb" drops to only about 8 bytes/cycle. Copying through SSE registers can easily keep up with this, perhaps even in a non-unrolled loop. Copying through 64-bit integer registers as for page zeroing and copying on amd64 can not so easily keep up with this. I think a non-unrolled loop runs at about 2 cycles/iteration. That only does 4 bytes/cycle with 64-bit registers, and only 2 bytes/cycle with 32-bit registers. amd64 uses 4-way unrolling. Apparently, the load/store instructions run at at most 1 pair per cycle, giving a maximum of 8 bytes per cycle; then any loop overhead the throughpoint to less than 8 bytes/cycle, so more unrolling helps a little. Page zeroing and copying might also rarely be within the L2 or L3 cache. Then the speed on Hasell drops to that of main memory, which is about 1.25 bytes/cycle on my system. Almost any method can keep up with this in theory, but in practice nontemporal stores through SSE registers (movntps is best for portability) are fastest (not counting their advantage of not thrashing the caches), and "rep movsb" is almost as good, and 128-bit accesses through SSE registers are almost as good as "rep movsb". However, in tests of makeworld on i386 systems with non-bloated workds it was better by 1-2% to not use nontemporal stores at all. i386 only uses them for pagezero, and only uses 32-bit movnti for them. Replacing this with simple memcpy (which uses "rep stosl", which runs at the same speed as "rep movsb") gave the 1-2% improvement. I also tried using 32-bit movnti for pagecopy -- this gave a 1-2% unimprovement. Perhaps the 32-bit accesses are too small, but "rep movsb" is so fast that it is hard to beat. (This was with current kernels and an old userland. My version uses movntps for i386 pagecopy and pagezero, and this gives improvements in the 1-2% range on older CPUs.) For nontemporal stores to be a pessimization, the page zeroing and copying must often give more hits later when nontemporal stores are not used. This is possible with cache sizes of several MB and my non-bloated world where the compiler's size is 5MB instead of 50MB as in -current. The cache size on my CPU is 8MB. This is shared with 4 real cores and 4 HTT cores so it is only 1MB per CPU and only about 512K per runnable thread with -j16. Even 5MB is more than enough bloat to thrash 512K. However, there is apparently enough locality for caching to help even for zeroed pages. If the zeroing happens on demand, than it is likely for the page to be accessed soon, so caching it helps. If the zeroing happens in advance, then under load perhaps zeroed pages get used soon enough that any caching of them helps. Bruce