Date: Thu, 26 Mar 2015 12:50:48 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Mateusz Guzik <mjguzik@gmail.com> Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, Bruce Evans <brde@optusnet.com.au> Subject: Re: svn commit: r280407 - head/sys/kern Message-ID: <20150326103908.G993@besplex.bde.org> In-Reply-To: <20150325201524.GB14280@dft-labs.eu> References: <201503240010.t2O0ACZb089906@svn.freebsd.org> <20150324135350.N1665@besplex.bde.org> <20150325201524.GB14280@dft-labs.eu>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 25 Mar 2015, Mateusz Guzik wrote: > On Tue, Mar 24, 2015 at 03:58:14PM +1100, Bruce Evans wrote: >> On Tue, 24 Mar 2015, Mateusz Guzik wrote: >> >>> Log: >>> filedesc: microoptimize fget_unlocked by getting rid of fd < 0 branch >> >> This has no effect. Compilers optimize to the equivalent of the the >> unsigned cast hack if this is good. On x86, it is good since no >> instructions are needed for the conversion, and the only difference >> between the code generated by >> >> if (fd >= fdt->fdt_nfiles) >> >> and >> >> if (fd < 0 || fd >= fdt->fdt_nfiles) >> >> is to change from jl (jump if less than) to jb (jump if below) (both >> jumps over the if clause). Negative values satisfy jl but not jb. > > I would not commit the change if it did not affect generated assembly at > least on amd64. Interesting. All compilers that I tested (gcc-4.2, gcc-4.8 and clang pessimize this). clang generates especially fancy "optimizations" which actually give pessimizations of 1-4 cycles for time and 15-25 bytes for space (larger time penalties from the space pessimization in real use where everything doesnt fit in the I-cache. gcc-4.8 only gives a 1 cycle time pessimization plus the 5-10 bytes of space pessimizations for the extra instructions for this, with less fancy scheduling that is easier to untangle to remove the 1-cycle pessimization. Micro- Benchmark program below. > if (fd < 0 || fd >= fdt->fdt_nfiles): > 0xffffffff807d147d <fget_unlocked+13>: sub $0x38,%rsp > 0xffffffff807d1481 <fget_unlocked+17>: test %esi,%esi > 0xffffffff807d1483 <fget_unlocked+19>: js 0xffffffff807d15b8 <fget_unlocked+328> objdump output is of low quality, especially on 64-bit arches where the 0xff... numbers are twice as bloated. This is the extra branch. In the micro-benchmark, clang pre-loads a pointer to the volatile return values and jumps to common code to load the value. This is its main pessimization relative to gcc. clang also generates code that runs into itself. It generates 2 loads of one of the pointers. This is mostly a space pessimization, except it may be necessary for the rest of the fancy scheduling to cost only 1 cycle. Here the return value for the error case is constant and the return value for the other case is variable, so this pessimization is inhibited. In the microbenchmark with the return values changed to non-volatile, clang does a fancier "optimization" involving pre-loading the values, then conditional moves. This gives an additional pessimization of 1 cycle. > 0xffffffff807d1489 <fget_unlocked+25>: mov (%rdi),%rbx > 0xffffffff807d148c <fget_unlocked+28>: cmp %esi,(%rbx) Here rbx is naturally a pointer, so this is the best comparison instruction. In the micro-benchmark, at least with the volatile limit, both clang and gcc-4.8 load a pointer to the limit instead of doing a cmp on the memory variable. Old versions of gcc know that this is usually bad, and have options -fforce-addr and -fforce-mem to change the default of not doing this. In the micro-benchmark, this just wastes space. In real use, loading the pointer to volatile data or loading non-volatile data is best if the variable is accessed more than once and the register pressure is not large. > 0xffffffff807d148e <fget_unlocked+30>: jle 0xffffffff807d15bf <fget_unlocked+335> > > if ((u_int)fd >= fdt->fdt_nfiles): > 0xffffffff807d147d <fget_unlocked+13>: sub $0x38,%rsp > 0xffffffff807d1481 <fget_unlocked+17>: mov (%rdi),%rbx > 0xffffffff807d1484 <fget_unlocked+20>: cmp %esi,(%rbx) > 0xffffffff807d1486 <fget_unlocked+22>: jbe 0xffffffff807d15a8 <fget_unlocked+312> > > I did not check other archs prior to the commit. > > This is clang 3.6 as present in head. Sources compiled with -O2. Also > see below for other compiler test. > >>> Casting fd to an unsigned type simplifies fd range coparison to mere checking >>> if the result is bigger than the table. >> >> No, it obfuscates the range comparison. > > It is a standard hack which is hard to misread and which seems to add a > slight benefit (see below). Because the compiler doesn't actually do it. I'm sure compilers sometimes do it. Apparently just when the top limit is constant. > ... > It affects assembly on all arm, powerpc64 and mips64 as well. Both arm > and powerpc just get rid of zero-test and use the same instructions to > perform the other comparison. I only found a difference on mips64 which > used sltu instead of slt (but still got rid of the zero-check). Getting rid of the zero test should give what I want. The code might differ due to register allocation. Are arm and powerpc still using gcc? > [..] >> - fget_locked() seems to be unchanged recently, so it doesn't have the >> obfuscation. fget_unlocked() is newer than the unobfuscated version >> of fget_locked(). It is in a different file, but may have copied the >> unobfuscated range check from fget_locked(). This commit restores the >> obfuscation to it alone. > > This definitely should be synced one way or the other, thanks for > pointing this out. To the unobfuscated version :-). > I wrote a toy program and checked e.g. gcc5 and it still did not > optimise zero-test away. > > In fact I would argue the optimisation in question is impossible unless > upper limit check is against a constant in (0, INT_MAX) range or against > a var whose range is known at compile time. Argh. That must be it, or possibly -fno-wrapv. I get the same results with and without wrapv. > In particular this is problematic for negative values. > > Consider: > int fd, limit; > ............ > if (fd < 0 || fd >= limit) > > Let's have fd = -5 and limit = -4. > > Should the fd < 0 check be dropped by the compiler and the expression > turned into (u_int)fd >= limit, the coparison would be false thus > changing the outcome. Wrapv doesn't affect this. Compilers can use wrapping behaviour internally for some things, but not for signed comparisons. E.g., on x86 for (fd >= limit) they can check the flags after (fd - limit). The flags are the same as after cmp(fd, limit). But the non-wrapping flag (ge) must be used, not the wrapping flag (ae). > As such, I prefer to keep the cast. Now I dislike the cast even more. The sign conversions are especially delicate. If the limit were u_int, then the sign conversions would happen without the cast, and are even more confusing since they are implicit. The limits for fd's are intentionally signed, the same as for fd's, since otherwise there would be lots of confusing automatic conversions and compiler warnings for some of them. BTW, sysctl limits like maxfiles and maxfilesperproc are signed. Foot- shooting is correctly allowed for these like for most sysctls, so they can be set to preposterous negative limits as well as to preposterously large positive ones, unlike for the corresponding rlimit. So the comparison (fd < (limit = -4)) may happen in practice. In practice, other sign extension bugs usually occur first. The first reference to maxfiles* in kern_descrip.c has 3 and a half sign/extension overflow bugs including one for maxfilesperproc. It uses min() to convert maxfilesperproc to u_int. It does the same for the other arg after first truncating the arg's value from 64 bits signed to 32 bits signed (not even to 32 bits unsigned to give defined behaviour and to match min()'s type). min() gives a 32-bit unsigned value. This is assigned to a td_retval[0], which is 32 bits signed on 32-bit arches and 64-bit signed on 64-bit arches, so there is only a bug on half of the arches for the final step. Problems don't accur in practice because the rlimit is clamped to well below 0x7fffffff. Values near RLIM_INFINITY are allowed for mose limits but not the files limit. Pseudo-infinite values like 0x7fffffff00000001 would become 1 after truncation. Micro-benchmark program: X #include <stdio.h> X X volatile int err = 1; X volatile int ok = 0; X volatile int toplimit = 20; X X int __noinline X foo(int x) X { X if (x < 0 || x >= toplimit) X return (err); X return (ok); X } X X int X main(void) X { X int errcount, i; X X errcount = 0; X for (i = 0; i < 266681953; i++) /* the magic number is tsc_freq/10 */ X errcount += foo(i); X printf("errcount = %d\n", errcount); X } Since the optimization that I wanted is invalid, this is just a benchmark of how far away the compiler is from generating optimal code. Timing on freefall: clang: 7 cycles clang: 7 cycles (with manual editing... didn't help, tended to harm) clang: 6 cycles (with above changed to use unsigned hack) gcc-4.8: 6 cycles gcc-4.8: 5 cycles (with manual editing of asm output to get the invalid opt) gcc-4.8: 5 cycles (with above changed to use unsigned hack) This is with -O. Other "optimizations are mostly pessimizations: clang -O2: no change gcc-4.8 -O2: 1 cycle slower clang with __predict_true(): 1 cycles slower clang with __predict_false(): 2 cycles slower (slower with the correct pred!) gcc-48 with __predict_false() or __predict_false() in above: no change Adding __predict_*() to the above with the unsigned hacks makes no further change. The bad code generated by clang for the above is: X movl $err, %eax X testl %edi, %edi X js .LBB0_3 X # BB#1: # %lor.lhs.false X movl toplimit(%rip), %ecx X movl $ok, %eax X cmpl %edi, %ecx X jg .LBB0_3 X # BB#2: # %select.mid X movl $err, %eax X .LBB0_3: # %return X movl (%rax), %eax X popq %rbp X retq This uses fancy scheduling of a bad method to lose by only 1 cycle relative to gcc. The loss is from pre-loading pointers so as to jump to a common load at the end. It seems stupid to load $err twice, but this saves 1 cycle. __predict_false() gives the following changes: X --- z.s~ 2015-03-26 01:11:17.825573000 +0000 X +++ z.s 2015-03-26 01:10:47.728912000 +0000 X @@ -19,10 +19,10 @@ X js .LBB0_3 X -# BB#1: # %lor.lhs.false X +# BB#1: # %lor.rhs X movl toplimit(%rip), %ecx X - movl $ok, %eax X + movl $err, %eax X cmpl %edi, %ecx X - jg .LBB0_3 X + jle .LBB0_3 X # BB#2: # %select.mid X - movl $err, %eax X -.LBB0_3: # %return X + movl $ok, %eax X +.LBB0_3: # %lor.end X movl (%rax), %eax This costs 2 cycles. It leaves a doubled load of $err which makes no sense now since the code no longer runs into itself and the second load is redundant. The extra instruction might be useful as padding, but is actually negatively useful. Removing it reduces the loss to 1 cycle relative to the unpredicted version, by giving essentially the same code as the unpredicted version with my editing to avoid the apparently-unnecessary doubled load. __predict_true() loses only 1 cycle, essentially by producing the above without the extra instruction. That it, it takes 8 cycles. 2 more than gcc and 3 more than the best seen with the unsigned hack. These cycle counts intentionally include many for loop and function call overhead. I used a function and many volatile variables to prevent most things being calculated a compile time. With __inline instead of __noinline, most or all variants take about 2.7 cycles with clang (I think 2 cycles is the minimum loop overhead and other operations run in parallel with the loop and themselves). gcc-4.8 then takes 2.7 cycles with the unsigned hack but 3.5 cycles without it. gcc's not so fancy scheduling loses instead of wins when it is inlined. In real code, the optimizer has even more chances to do fancy scheduling to hide the latency of extra branches, but 1 more branch may thrash branch prediction caches. Anyway, you won't be able to measure 1-cycle optimizations in real code. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150326103908.G993>