From owner-svn-src-all@freebsd.org Sun Jan 22 17:10:33 2017 Return-Path: Delivered-To: svn-src-all@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 7C966CBDFC2; Sun, 22 Jan 2017 17:10:33 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 18DF9DEA; Sun, 22 Jan 2017 17:10:32 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id v0MHAQ6U094157 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Sun, 22 Jan 2017 19:10:26 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua v0MHAQ6U094157 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id v0MHAQI6094156; Sun, 22 Jan 2017 19:10:26 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Sun, 22 Jan 2017 19:10:26 +0200 From: Konstantin Belousov To: Bruce Evans Cc: Mateusz Guzik , Mateusz Guzik , src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r312600 - head/sys/kern Message-ID: <20170122171026.GH2349@kib.kiev.ua> References: <201701211838.v0LIcHIv072626@repo.freebsd.org> <20170121195114.GA2349@kib.kiev.ua> <20170122115716.T953@besplex.bde.org> <20170122092228.GC20930@dft-labs.eu> <20170122224849.E897@besplex.bde.org> <20170122125716.GD2349@kib.kiev.ua> <20170123015806.E1391@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170123015806.E1391@besplex.bde.org> User-Agent: Mutt/1.7.2 (2016-11-26) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 22 Jan 2017 17:10:33 -0000 On Mon, Jan 23, 2017 at 04:00:19AM +1100, Bruce Evans wrote: > On Sun, 22 Jan 2017, Konstantin Belousov wrote: > > > On Sun, Jan 22, 2017 at 11:41:09PM +1100, Bruce Evans wrote: > >> On Sun, 22 Jan 2017, Mateusz Guzik wrote: > >>> ... > >>> I have to disagree about the usefulness remark. If you check generated > >>> assembly for amd64 (included below), you will see the uncommon code is > >>> moved out of the way and in particular there are no forward jumps in the > >>> common case. > >> > >> Check benchmarks. A few cycles here and there are in the noise. Kernel > >> code has very few possibilities for useful optimizations since it doesn't > >> have many inner loops. > >> > >>> With __predict_false: > >>> > >>> [snip prologue] > >>> 0xffffffff8084ecaf <+31>: mov 0x24(%rbx),%eax > >>> 0xffffffff8084ecb2 <+34>: test $0x40,%ah > >>> 0xffffffff8084ecb5 <+37>: jne 0xffffffff8084ece2 > >> > >> All that this does is as you say -- invert the condition to jump to the > >> uncommon code. This made more of a difference on old CPUs (possiblly > >> still on low end/non-x86). Now branch predictors are too good for the > >> slow case to be much slower. > >> > >> I think x86 branch predictors initially predict forward branches as not > >> taken and backward branches as taken. So this branch was initially > >> mispredicted, and the change fixes this. But the first branch doesn't > >> really matter. Starting up takes hundreds or thousands of cycles for > >> cache misses. > > This is only true if branch predictor memory is large enough to keep the > > state for the given branch between exercising it. Even if the predictor > > state could be attached to every byte in the icache, or more likely, > > every line in the icache or uop cache, it still probably too small to > > survive between user->kernel transitions for syscalls. Might be there is > > performance counter which shows branch predictor mis-predictions. > > > > In other words, I suspect that there almost all cases might be > > mis-predictions without manual hint, and mis-predictions together with > > the full pipeline flush on VFS-intensive load very well might give tens > > percents of the total cycles on the modern cores. > > > > Just speculation. > > Check benchmarks. > > I looked at the mis-prediction counts mainly for a networking micro-benchmark > alsmost 10 years ago. They seemed to be among the least of the performance > problems (the main ones were general bloat and cache misses). I think the > branch-predictor caches on even 10-year old x86 are quite large, enough to > hold tens or hundreds of syscalls. Otherwise performance would be lower > than it is. > > Testing shows that the cache size is about 2048 on Athlon-XP. I might be > measuring just the size of the L1 Icache interacting with the branch > predictor: > > The program is for i386 and needs some editing: > > X int > X main(void) > X { > X asm(" \n\ > X pushal \n\ > X movl $192105,%edi \n\ > > Set this to $(sysctl -n machdep.tsc_freq) / 10000 to count cycles easily. > > X 1: \n\ > X # beware of clobbering in padding \n\ > X pushal \n\ > X xorl %eax,%eax \n\ > X # repeat next many times, e.g., 2047 times on Athlon-xp \n\ > X jz 2f; .p2align 3; 2: \n\ > > With up to 2048 branches, each branch takes 2 cycles on Athlon-XP. > After that, each branch takes 10.8 cycles. > > I don't understand why the alignment is needed, but without it each branch > takes 9 cycles instead of 2 starting with just 2 jz's. My guess this is the predictor graining issue I alluded to earlier. E.g. Agner Fog' manuals state that for K8/K10, ============================== The branch prediction mechanism allows no more than three taken branches for every aligned 16-byte block of code. ============================== The benchmark does not check the cost of misprediction, since the test consists only of the thread of branches, there is no speculative state to unwind. > > "jmp" branches are not predicted any better than the always-taken "jz" > braches. Alignment is needed similarly. > > Change "jz" to "jnz" to see the speed with branches never taken. This > takes 2 cycles for any number of branches up to 8K when the L1 Icache > runs out. Now the default prediction of not-taken is correct, so there > are no mispredictions. > > The alignment costs 0.5 cycles with a small number of jnz's and 0.03 > cycles with a large number of jz's or jmp's. It helps with a large > number of jnz's. > > X popal \n\ > X decl %edi \n\ > X jne 1b \n\ > X popal \n\ > X "); > X return (0); > X } > > Timing on Haswell: > - Haswell only benefits slightly from the alignment and reaches full > speed with ".p2align 2" > - 1 cycle instead of 2 for branch-not-taken > - 2.1 cycles instead of 2 minimum for branch-taken > - predictor cache size 4K instead of 2K > - 12 cycles instead of 10.8 for branches mispredicted by the default for > more than 4K jz's. > > Bruce