Date: Mon, 23 Jan 2017 04:00:19 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Konstantin Belousov <kostikbel@gmail.com> Cc: Bruce Evans <brde@optusnet.com.au>, Mateusz Guzik <mjguzik@gmail.com>, Mateusz Guzik <mjg@freebsd.org>, src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r312600 - head/sys/kern Message-ID: <20170123015806.E1391@besplex.bde.org> In-Reply-To: <20170122125716.GD2349@kib.kiev.ua> References: <201701211838.v0LIcHIv072626@repo.freebsd.org> <20170121195114.GA2349@kib.kiev.ua> <20170122115716.T953@besplex.bde.org> <20170122092228.GC20930@dft-labs.eu> <20170122224849.E897@besplex.bde.org> <20170122125716.GD2349@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 22 Jan 2017, Konstantin Belousov wrote: > On Sun, Jan 22, 2017 at 11:41:09PM +1100, Bruce Evans wrote: >> On Sun, 22 Jan 2017, Mateusz Guzik wrote: >>> ... >>> I have to disagree about the usefulness remark. If you check generated >>> assembly for amd64 (included below), you will see the uncommon code is >>> moved out of the way and in particular there are no forward jumps in the >>> common case. >> >> Check benchmarks. A few cycles here and there are in the noise. Kernel >> code has very few possibilities for useful optimizations since it doesn't >> have many inner loops. >> >>> With __predict_false: >>> >>> [snip prologue] >>> 0xffffffff8084ecaf <+31>: mov 0x24(%rbx),%eax >>> 0xffffffff8084ecb2 <+34>: test $0x40,%ah >>> 0xffffffff8084ecb5 <+37>: jne 0xffffffff8084ece2 <vn_closefile+82> >> >> All that this does is as you say -- invert the condition to jump to the >> uncommon code. This made more of a difference on old CPUs (possiblly >> still on low end/non-x86). Now branch predictors are too good for the >> slow case to be much slower. >> >> I think x86 branch predictors initially predict forward branches as not >> taken and backward branches as taken. So this branch was initially >> mispredicted, and the change fixes this. But the first branch doesn't >> really matter. Starting up takes hundreds or thousands of cycles for >> cache misses. > This is only true if branch predictor memory is large enough to keep the > state for the given branch between exercising it. Even if the predictor > state could be attached to every byte in the icache, or more likely, > every line in the icache or uop cache, it still probably too small to > survive between user->kernel transitions for syscalls. Might be there is > performance counter which shows branch predictor mis-predictions. > > In other words, I suspect that there almost all cases might be > mis-predictions without manual hint, and mis-predictions together with > the full pipeline flush on VFS-intensive load very well might give tens > percents of the total cycles on the modern cores. > > Just speculation. Check benchmarks. I looked at the mis-prediction counts mainly for a networking micro-benchmark alsmost 10 years ago. They seemed to be among the least of the performance problems (the main ones were general bloat and cache misses). I think the branch-predictor caches on even 10-year old x86 are quite large, enough to hold tens or hundreds of syscalls. Otherwise performance would be lower than it is. Testing shows that the cache size is about 2048 on Athlon-XP. I might be measuring just the size of the L1 Icache interacting with the branch predictor: The program is for i386 and needs some editing: X int X main(void) X { X asm(" \n\ X pushal \n\ X movl $192105,%edi \n\ Set this to $(sysctl -n machdep.tsc_freq) / 10000 to count cycles easily. X 1: \n\ X # beware of clobbering in padding \n\ X pushal \n\ X xorl %eax,%eax \n\ X # repeat next many times, e.g., 2047 times on Athlon-xp \n\ X jz 2f; .p2align 3; 2: \n\ With up to 2048 branches, each branch takes 2 cycles on Athlon-XP. After that, each branch takes 10.8 cycles. I don't understand why the alignment is needed, but without it each branch takes 9 cycles instead of 2 starting with just 2 jz's. "jmp" branches are not predicted any better than the always-taken "jz" braches. Alignment is needed similarly. Change "jz" to "jnz" to see the speed with branches never taken. This takes 2 cycles for any number of branches up to 8K when the L1 Icache runs out. Now the default prediction of not-taken is correct, so there are no mispredictions. The alignment costs 0.5 cycles with a small number of jnz's and 0.03 cycles with a large number of jz's or jmp's. It helps with a large number of jnz's. X popal \n\ X decl %edi \n\ X jne 1b \n\ X popal \n\ X "); X return (0); X } Timing on Haswell: - Haswell only benefits slightly from the alignment and reaches full speed with ".p2align 2" - 1 cycle instead of 2 for branch-not-taken - 2.1 cycles instead of 2 minimum for branch-taken - predictor cache size 4K instead of 2K - 12 cycles instead of 10.8 for branches mispredicted by the default for more than 4K jz's. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170123015806.E1391>