Date: Sun, 22 Jan 2017 19:10:26 +0200 From: Konstantin Belousov <kostikbel@gmail.com> To: Bruce Evans <brde@optusnet.com.au> Cc: Mateusz Guzik <mjguzik@gmail.com>, Mateusz Guzik <mjg@freebsd.org>, src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r312600 - head/sys/kern Message-ID: <20170122171026.GH2349@kib.kiev.ua> In-Reply-To: <20170123015806.E1391@besplex.bde.org> References: <201701211838.v0LIcHIv072626@repo.freebsd.org> <20170121195114.GA2349@kib.kiev.ua> <20170122115716.T953@besplex.bde.org> <20170122092228.GC20930@dft-labs.eu> <20170122224849.E897@besplex.bde.org> <20170122125716.GD2349@kib.kiev.ua> <20170123015806.E1391@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Jan 23, 2017 at 04:00:19AM +1100, Bruce Evans wrote: > On Sun, 22 Jan 2017, Konstantin Belousov wrote: > > > On Sun, Jan 22, 2017 at 11:41:09PM +1100, Bruce Evans wrote: > >> On Sun, 22 Jan 2017, Mateusz Guzik wrote: > >>> ... > >>> I have to disagree about the usefulness remark. If you check generated > >>> assembly for amd64 (included below), you will see the uncommon code is > >>> moved out of the way and in particular there are no forward jumps in the > >>> common case. > >> > >> Check benchmarks. A few cycles here and there are in the noise. Kernel > >> code has very few possibilities for useful optimizations since it doesn't > >> have many inner loops. > >> > >>> With __predict_false: > >>> > >>> [snip prologue] > >>> 0xffffffff8084ecaf <+31>: mov 0x24(%rbx),%eax > >>> 0xffffffff8084ecb2 <+34>: test $0x40,%ah > >>> 0xffffffff8084ecb5 <+37>: jne 0xffffffff8084ece2 <vn_closefile+82> > >> > >> All that this does is as you say -- invert the condition to jump to the > >> uncommon code. This made more of a difference on old CPUs (possiblly > >> still on low end/non-x86). Now branch predictors are too good for the > >> slow case to be much slower. > >> > >> I think x86 branch predictors initially predict forward branches as not > >> taken and backward branches as taken. So this branch was initially > >> mispredicted, and the change fixes this. But the first branch doesn't > >> really matter. Starting up takes hundreds or thousands of cycles for > >> cache misses. > > This is only true if branch predictor memory is large enough to keep the > > state for the given branch between exercising it. Even if the predictor > > state could be attached to every byte in the icache, or more likely, > > every line in the icache or uop cache, it still probably too small to > > survive between user->kernel transitions for syscalls. Might be there is > > performance counter which shows branch predictor mis-predictions. > > > > In other words, I suspect that there almost all cases might be > > mis-predictions without manual hint, and mis-predictions together with > > the full pipeline flush on VFS-intensive load very well might give tens > > percents of the total cycles on the modern cores. > > > > Just speculation. > > Check benchmarks. > > I looked at the mis-prediction counts mainly for a networking micro-benchmark > alsmost 10 years ago. They seemed to be among the least of the performance > problems (the main ones were general bloat and cache misses). I think the > branch-predictor caches on even 10-year old x86 are quite large, enough to > hold tens or hundreds of syscalls. Otherwise performance would be lower > than it is. > > Testing shows that the cache size is about 2048 on Athlon-XP. I might be > measuring just the size of the L1 Icache interacting with the branch > predictor: > > The program is for i386 and needs some editing: > > X int > X main(void) > X { > X asm(" \n\ > X pushal \n\ > X movl $192105,%edi \n\ > > Set this to $(sysctl -n machdep.tsc_freq) / 10000 to count cycles easily. > > X 1: \n\ > X # beware of clobbering in padding \n\ > X pushal \n\ > X xorl %eax,%eax \n\ > X # repeat next many times, e.g., 2047 times on Athlon-xp \n\ > X jz 2f; .p2align 3; 2: \n\ > > With up to 2048 branches, each branch takes 2 cycles on Athlon-XP. > After that, each branch takes 10.8 cycles. > > I don't understand why the alignment is needed, but without it each branch > takes 9 cycles instead of 2 starting with just 2 jz's. My guess this is the predictor graining issue I alluded to earlier. E.g. Agner Fog' manuals state that for K8/K10, ============================== The branch prediction mechanism allows no more than three taken branches for every aligned 16-byte block of code. ============================== The benchmark does not check the cost of misprediction, since the test consists only of the thread of branches, there is no speculative state to unwind. > > "jmp" branches are not predicted any better than the always-taken "jz" > braches. Alignment is needed similarly. > > Change "jz" to "jnz" to see the speed with branches never taken. This > takes 2 cycles for any number of branches up to 8K when the L1 Icache > runs out. Now the default prediction of not-taken is correct, so there > are no mispredictions. > > The alignment costs 0.5 cycles with a small number of jnz's and 0.03 > cycles with a large number of jz's or jmp's. It helps with a large > number of jnz's. > > X popal \n\ > X decl %edi \n\ > X jne 1b \n\ > X popal \n\ > X "); > X return (0); > X } > > Timing on Haswell: > - Haswell only benefits slightly from the alignment and reaches full > speed with ".p2align 2" > - 1 cycle instead of 2 for branch-not-taken > - 2.1 cycles instead of 2 minimum for branch-taken > - predictor cache size 4K instead of 2K > - 12 cycles instead of 10.8 for branches mispredicted by the default for > more than 4K jz's. > > Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170122171026.GH2349>