Date: Sun, 22 Jan 2017 23:41:09 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Mateusz Guzik <mjguzik@gmail.com> Cc: Bruce Evans <brde@optusnet.com.au>, Konstantin Belousov <kostikbel@gmail.com>, Mateusz Guzik <mjg@freebsd.org>, src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r312600 - head/sys/kern Message-ID: <20170122224849.E897@besplex.bde.org> In-Reply-To: <20170122092228.GC20930@dft-labs.eu> References: <201701211838.v0LIcHIv072626@repo.freebsd.org> <20170121195114.GA2349@kib.kiev.ua> <20170122115716.T953@besplex.bde.org> <20170122092228.GC20930@dft-labs.eu>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 22 Jan 2017, Mateusz Guzik wrote: > On Sun, Jan 22, 2017 at 12:07:56PM +1100, Bruce Evans wrote: > ... >> Later commits further unimproved style by adding __predict_ugly() and >> long lines from blind substitution of that. __predict_ugly() is almost >> as useful as 'register', but more invasive. > > There were no __predict changes to this function. > > I did add some to vn_closefile in r312602. > > Indeed, one line is now 82 chars and arguably could be modified to fit > the limit. > > I have to disagree about the usefulness remark. If you check generated > assembly for amd64 (included below), you will see the uncommon code is > moved out of the way and in particular there are no forward jumps in the > common case. Check benchmarks. A few cycles here and there are in the noise. Kernel code has very few possibilities for useful optimizations since it doesn't have many inner loops. > With __predict_false: > > [snip prologue] > 0xffffffff8084ecaf <+31>: mov 0x24(%rbx),%eax > 0xffffffff8084ecb2 <+34>: test $0x40,%ah > 0xffffffff8084ecb5 <+37>: jne 0xffffffff8084ece2 <vn_closefile+82> All that this does is as you say -- invert the condition to jump to the uncommon code. This made more of a difference on old CPUs (possiblly still on low end/non-x86). Now branch predictors are too good for the slow case to be much slower. I think x86 branch predictors initially predict forward branches as not taken and backward branches as taken. So this branch was initially mispredicted, and the change fixes this. But the first branch doesn't really matter. Starting up takes hundreds or thousands of cycles for cache misses. Moving the uncommon code out of the way reduces Icache pressure a bit. I've never been able to measure a significant difference from this. However, I usually compile with -Os to reduce Icache pressure. This is almost as good for speed as -O (typically at most 5% slower than -O2 for kernels). The other day, I tried compiling with -O2 -march=highest-supported and found that on a network microbenchmark which was supposed to benefit from this, it was significantly slower. It was with a very old version of gcc. I think this is because -O2 rearranges the code a lot, and since at least the old version of gcc doesn't understand caches at all, the arrangement accidentally ended up much worse and caused more cache misses. With another compiler and NIC, -O2 gave the expected speeding. > 0xffffffff8084ecb7 <+39>: mov 0x24(%rbx),%esi > 0xffffffff8084ecba <+42>: mov 0x10(%rbx),%rdx > 0xffffffff8084ecbe <+46>: mov %r14,%rdi > 0xffffffff8084ecc1 <+49>: mov %r15,%rcx > 0xffffffff8084ecc4 <+52>: callq 0xffffffff80850000 <vn_close> > 0xffffffff8084ecc9 <+57>: mov %eax,%r15d > 0xffffffff8084eccc <+60>: mov 0x24(%rbx),%eax > 0xffffffff8084eccf <+63>: test $0x40,%ah > 0xffffffff8084ecd2 <+66>: jne 0xffffffff8084ecf5 <vn_closefile+101> > [snip cleanup] > 0xffffffff8084ece1 <+81>: retq The code for he uncommon case is now here (not shown). It will be no smaller (actually slightly larger for a branch back) unless it can be shared. > > Without __predict_false: > [snip prologue] > 0xffffffff8084ecaf <+31>: mov 0x24(%rbx),%eax > 0xffffffff8084ecb2 <+34>: test $0x40,%ah > 0xffffffff8084ecb5 <+37>: je 0xffffffff8084ecc8 <vn_closefile+56> The branch is not very far away, so perhaps the optimization from the move is null. > 0xffffffff8084ecb7 <+39>: movzwl 0x20(%rbx),%eax > 0xffffffff8084ecbb <+43>: cmp $0x1,%eax > 0xffffffff8084ecbe <+46>: jne 0xffffffff8084ecc8 <vn_closefile+56> > 0xffffffff8084ecc0 <+48>: mov %r14,%rdi > 0xffffffff8084ecc3 <+51>: callq 0xffffffff8083e270 <vrefact> The branch was to here <+56>. The function is not very well aligned. It has address 0x90 mod 0x100, so starts 0x10 into a 64-bit cache line (is that still the line size for all modern x86?). So 48 bytes are in a line, but this +56 is in the next line. But aligning functions and branches to cache line boundaries works poorly in general. It wastes Icache in another way. -march affects this a lot, and -march=higher is often worse since the compiler doesn't really understand this and does more or less alignment than is good. Here it didn't bother aligning even 1 of the branch targets. > 0xffffffff8084ecc8 <+56>: mov 0x24(%rbx),%esi > 0xffffffff8084eccb <+59>: mov 0x10(%rbx),%rdx > 0xffffffff8084eccf <+63>: mov %r14,%rdi > 0xffffffff8084ecd2 <+66>: mov %r15,%rcx The main code ends up just +17 later at <+69>. Fitting it all below <+64> would be better but is not necessary if the instructions are slow enough to all the instruction fetch to run ahead. This depends on good prediction to not start too many instructions in paths that won't be followed. > 0xffffffff8084ecd5 <+69>: callq 0xffffffff80850000 <vn_close> > 0xffffffff8084ecda <+74>: mov %eax,%r15d > 0xffffffff8084ecdd <+77>: mov 0x24(%rbx),%eax > 0xffffffff8084ece0 <+80>: test $0x40,%ah > 0xffffffff8084ece3 <+83>: je 0xffffffff8084ed45 <vn_closefile+181> > 0xffffffff8084ece5 <+85>: movzwl 0x20(%rbx),%eax > 0xffffffff8084ece9 <+89>: cmp $0x1,%eax > 0xffffffff8084ecec <+92>: jne 0xffffffff8084ed45 <vn_closefile+181> > 0xffffffff8084ecee <+94>: movw $0x0,-0x22(%rbp) > [skip some parts, +181 marks the beginning of cleanup] > 0xffffffff8084ed52 <+194>: retq Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170122224849.E897>