From owner-svn-src-all@freebsd.org Mon Aug 1 02:35:21 2016 Return-Path: Delivered-To: svn-src-all@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 16E1ABAA5B7; Mon, 1 Aug 2016 02:35:21 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by mx1.freebsd.org (Postfix) with ESMTP id 204AF1F96; Mon, 1 Aug 2016 02:35:20 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c122-106-149-109.carlnfd1.nsw.optusnet.com.au (c122-106-149-109.carlnfd1.nsw.optusnet.com.au [122.106.149.109]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id E12B0421A52; Mon, 1 Aug 2016 12:35:15 +1000 (AEST) Date: Mon, 1 Aug 2016 12:35:15 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov cc: Bruce Evans , Mateusz Guzik , src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r303583 - head/sys/amd64/amd64 In-Reply-To: <20160731163527.GZ83214@kib.kiev.ua> Message-ID: <20160801105417.C919@besplex.bde.org> References: <201607311134.u6VBY81j031059@repo.freebsd.org> <20160731220407.Q3033@besplex.bde.org> <20160731163527.GZ83214@kib.kiev.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=VIkg5I7X c=1 sm=1 tr=0 a=R/f3m204ZbWUO/0rwPSMPw==:117 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=K8udwBj-oB1w_VQ5t28A:9 a=2QhXWfjkxqwJA8as:21 a=cqztS_KOFIGvzGm9:21 a=CjuIK1q_8ugA:10 X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 01 Aug 2016 02:35:21 -0000 On Sun, 31 Jul 2016, Konstantin Belousov wrote: > On Sun, Jul 31, 2016 at 11:11:25PM +1000, Bruce Evans wrote: I said that I didn't replace (sse2) pagecopy() by bcopy() on amd64 for Haswell. Actually I do, for a small improvement on makeworld. i386 doesn't have (sse*) pagecopy() except in some of my versions, so I don't need to do anything to get the same improvement on the same Haswell. >> On Haswell, "rep stos" takes about 25 cycles to start up, and the function >> call overhead is in the noise. 25 cycles is a lot. Haswell can move >> 32 bytes/cycle from L2 to L2, so it misses moving 800 bytes or 1/5 of a >> page in its startup overhead. Oops, that is for "rep movs". "rep stos" >> is similar. > The commit message contained a probable explanation of the reason why > the change demonstrated measurable improvement in non-microbenchmark load. Pagefaults give some locality, but I think not enough to explain much of the improvement or the larger negative improvements that I measure. makeworld isn't a micro-benchmark. For a tuned ~5.2 world it does about 32 million pagezero()s. makeworld does only 2728 pagefaults with warm (VMIO and buffer...) caches on i386. 24866 with cold caches. On amd64, 15% lower. Page reclaims are about 17 million on i386 and 27 million on amd64. Either page faults each touch a lot of pages (so that nontemporal stores should help in theory by avoiding busting L1 and depleting L2 on every pagefault), or there is a lot of pre-zeroing (so again nontemporal stors should help in theory). In fact, nontemporal stores help in practice on Turion2. Haswell has better caches and that is probably the main reason that nontemporal stores are slower in practice. Turion2 also benefited from the old implementation of pagezero in idle. Clearly, zeroing in idle should use nontemporal stores. But when nontemporal stores are much slower, there is less likely to be enough otherwise-idle cycles to do enough of them. Zeroing in idle works poorly now, and is turned off. On systems with HTT, idle CPUs aren't created equally and aren't really idle if using them would steal sources from another CPU. > That said, the only thing I am answering and asking there is the above > claim about 25 cycles overhead of rep;stosq on hsw. I am curious how > the overhead was measured. Note: Agner Fog' tables state that fast mode > takes <2n uops and has reciprocal throughput of 0.5n worst case and do > not demostrate any setup overhead for hsw. I think the target is 0.25n best case (32 bytes/cycle only 8 bytes wide using integer instructions). ISTR that Fog says something about the latency. He does for older CPUs. I've never noticed latency for x86 string instructions being below about 15 cycles, and the fast string operations have to do more setup so it would be surprising if they had lower latency. To measure latency, just time bcopy() and bzero() with different sizes in a loop and take differences. Use small sizes to stay in L1 and avoid cache misses (except for preemption). I get the following times for amd64 on Haswell @ 4.080 GHz. (These times also disprove my claim that bzero() is just as good as a specialized function -- latency makes it significantly slower except for unusually large sizes.): size 4096 size 8192 0.25n throughput: 130.56 130.56 rep movsb alone in a function: 96.5 110.9 (speeds in 1e9 B/s) 45+0.25n: 96.6 111.0 memcpy (rep movsq in libc): 72.5 92.9 102+0.25n: 72.7 93.4 rep stosq alone in function: 105.8 116.7 31+0.25n: 105.1 116.9 25 is about right for rep stosq inline -- the function call adds about 5, and that is in the fastest possible case with the call in a loop. libc memcpy must be doing something very stupid to take 102 cycles. Note that Haswell can't get very near 0+0.25n because sizes slightly larger than 2*8192. Haswell's L1 is too small to get very near to amortising the startup overhead. The fastest speed I could find for rep movsb in a function was 115.4 for size 13K. Larger sizes are slower because they don't fit in L1 (2 * 14K fits in 32K L1 but is still slower for some reason). Latency for non-rep string functions is also interesting. I think it is almost as high, making these instructions useless for all purposes except saving space on all CPUs and saving time on CPUs almost as old as the 8088 (on the 8088, instruction fetch was very slow, so it was faster to use 1-byte instructions if possible). Bruce