From owner-svn-src-head@freebsd.org Sun Jul 31 15:26:33 2016 Return-Path: Delivered-To: svn-src-head@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 62F4ABA9BB7; Sun, 31 Jul 2016 15:26:33 +0000 (UTC) (envelope-from slw@zxy.spb.ru) Received: from zxy.spb.ru (zxy.spb.ru [195.70.199.98]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 236CC1CA4; Sun, 31 Jul 2016 15:26:33 +0000 (UTC) (envelope-from slw@zxy.spb.ru) Received: from slw by zxy.spb.ru with local (Exim 4.86 (FreeBSD)) (envelope-from ) id 1bTsdJ-000Hr5-Lw; Sun, 31 Jul 2016 18:26:29 +0300 Date: Sun, 31 Jul 2016 18:26:29 +0300 From: Slawa Olhovchenkov To: Bruce Evans Cc: Mateusz Guzik , svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org Subject: Re: svn commit: r303583 - head/sys/amd64/amd64 Message-ID: <20160731152629.GA8192@zxy.spb.ru> References: <201607311134.u6VBY81j031059@repo.freebsd.org> <20160731220407.Q3033@besplex.bde.org> <20160731135129.GA22212@zxy.spb.ru> <20160801000046.X3364@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160801000046.X3364@besplex.bde.org> User-Agent: Mutt/1.5.24 (2015-08-30) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: slw@zxy.spb.ru X-SA-Exim-Scanned: No (on zxy.spb.ru); SAEximRunCond expanded to false X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 31 Jul 2016 15:26:33 -0000 On Mon, Aug 01, 2016 at 12:30:14AM +1000, Bruce Evans wrote: > On Sun, 31 Jul 2016, Slawa Olhovchenkov wrote: > > > On Sun, Jul 31, 2016 at 11:11:25PM +1000, Bruce Evans wrote: > > > >> Misalignment of this loop made it almost twice as slow on old Turion2 with > >> slow DDR2 memory. It made no difference on Haswell. I added an extra > >> movnti, but that makes little or no differences. 2 more movnti's wouldn't > >> fit in a 16-byte cache line so are slower unless even more care is taken > >> with alignment (or with less care, 4 with misalignment are not less than > >> twice as slow as 1 with alignment). > >> > >> I thought that alignment and unrolling didn't matter here, because movnti > >> has to wait for memory and almost any loop runs fast enough to keep up. > >> The timing on my old system is something like: CPUs at 2 GHz; main memory > >> at 4 GB/sec; movnti is only 4 bytes wide on i386 (so this problem > >> only affects i386, at least with slow memory). So sustaining 4 GB/sec > >> requires 1 G movnti's/sec, so the loop needs to run at 2 cycles/iteration > >> to keep up. But when it is misaligned, it runs at 3-4 cycles/iteration. > >> Alignment makes it take about 2, and the extra movnti is for safety and > >> to work with faster memory. > >> > >> On Haswell with CPUs at 4 GHz, 2 cycles/iteration gives 8 GB/sec on > >> i386 and 16 GB/sec on amd64 with wider movnti. IIRC, 16 GB/sec is about > >> the main memory speed so nothing better is possible but just 1 extra > >> movnti gives more with faster memory. This is just worse than bzero() > > > > What about modern system with 120 GB/sec main memory speed? > > Is there such a system? It would have main memory almost twice as fast > as Haswell L2 and almost half as fast as Haswell L1. http://ark.intel.com/products/family/93797/Intel-Xeon-Processor-E7-v4-Family#@Server 102 GB/s (sorry, 120 is misprint) > My fastest memory actually does 20001 MB/s according to old memtest > and that is about right according to other tests. Some short time I am have free 1650v4 http://ark.intel.com/products/92994/Intel-Xeon-Processor-E5-1650-v4-15M-Cache-3_60-GHz with up to 76.8 GB/s (by datasheet, at DDR4-2400). With installed DDR4-2133 -- up to 68.2 GB/s (teoretical) After short time system put into production. I am unable to boot UEFI Memtest86 7.0, old version (4.3.7) show 15 GB/s. # ramspeed -b 18 -p 4 RAMspeed/SMP (FreeBSD) v3.5.0 by Rhett M. Hollander and Paul V. Bolotoff, 2002-09 8Gb per pass mode, 4 processes SSE (nt) Copy: 54176.91 MB/s [NTA prefetch] SSE (nt) Scale: 54241.98 MB/s [NTA prefetch] SSE (nt) Add: 48945.60 MB/s [T0 prefetch] SSE (nt) Triad: 50102.80 MB/s [T0 prefetch] --- SSE (nt) AVERAGE: 51866.82 MB/s # ramspeed -b 16 -p 4 RAMspeed/SMP (FreeBSD) v3.5.0 by Rhett M. Hollander and Paul V. Bolotoff, 2002-09 8Gb per pass mode, 4 processes SSE & WRITING (nt) 1 Kb block: 55913.18 MB/s SSE & WRITING (nt) 2 Kb block: 60819.02 MB/s SSE & WRITING (nt) 4 Kb block: 58662.37 MB/s SSE & WRITING (nt) 8 Kb block: 57165.14 MB/s SSE & WRITING (nt) 16 Kb block: 56310.22 MB/s SSE & WRITING (nt) 32 Kb block: 56407.22 MB/s SSE & WRITING (nt) 64 Kb block: 58200.44 MB/s SSE & WRITING (nt) 128 Kb block: 59213.49 MB/s SSE & WRITING (nt) 256 Kb block: 59047.57 MB/s SSE & WRITING (nt) 512 Kb block: 59158.01 MB/s SSE & WRITING (nt) 1024 Kb block: 59140.03 MB/s SSE & WRITING (nt) 2048 Kb block: 59165.49 MB/s SSE & WRITING (nt) 4096 Kb block: 59714.68 MB/s SSE & WRITING (nt) 8192 Kb block: 59926.68 MB/s SSE & WRITING (nt) 16384 Kb block: 59100.03 MB/s SSE & WRITING (nt) 32768 Kb block: 58268.52 MB/s # ramspeed -b 16 -p 2 RAMspeed/SMP (FreeBSD) v3.5.0 by Rhett M. Hollander and Paul V. Bolotoff, 2002-09 8Gb per pass mode, 2 processes SSE & WRITING (nt) 1 Kb block: 32131.03 MB/s SSE & WRITING (nt) 2 Kb block: 41851.23 MB/s SSE & WRITING (nt) 4 Kb block: 41848.02 MB/s SSE & WRITING (nt) 8 Kb block: 41640.80 MB/s SSE & WRITING (nt) 16 Kb block: 41640.60 MB/s SSE & WRITING (nt) 32 Kb block: 41639.89 MB/s SSE & WRITING (nt) 64 Kb block: 41849.65 MB/s SSE & WRITING (nt) 128 Kb block: 41848.74 MB/s SSE & WRITING (nt) 256 Kb block: 41847.87 MB/s SSE & WRITING (nt) 512 Kb block: 41846.14 MB/s SSE & WRITING (nt) 1024 Kb block: 41835.69 MB/s SSE & WRITING (nt) 2048 Kb block: 41815.94 MB/s SSE & WRITING (nt) 4096 Kb block: 41717.39 MB/s SSE & WRITING (nt) 8192 Kb block: 41575.85 MB/s SSE & WRITING (nt) 16384 Kb block: 41295.03 MB/s SSE & WRITING (nt) 32768 Kb block: 40735.83 MB/s