Date: Sat, 5 May 2018 18:08:14 +0200 From: Mateusz Guzik <mjguzik@gmail.com> To: Bruce Evans <brde@optusnet.com.au> Cc: Conrad Meyer <cem@freebsd.org>, Brooks Davis <brooks@freebsd.org>, src-committers <src-committers@freebsd.org>, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r333240 - in head/sys: powerpc/powerpc sys Message-ID: <CAGudoHHKvPKiprG36i=JEWsL39JWB1fZ86QH2Y7DwaJK0=WdLQ@mail.gmail.com> In-Reply-To: <20180505090954.X1307@besplex.bde.org> References: <201805040400.w4440moH025057@repo.freebsd.org> <20180504155301.GA56280@spindle.one-eyed-alien.net> <CAG6CVpUr_2b-yT1-uExcY1Tvpg=-P3y_owNHQ0UPg604to8Y0Q@mail.gmail.com> <20180505090954.X1307@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, May 5, 2018 at 2:38 AM, Bruce Evans <brde@optusnet.com.au> wrote: I don't believe the claimed speeding of using the optimized bcopy() > in cpu_fetch_sycall_args() on amd64 (r333241). This changes the copy > size from a variable to 6*8 bytes. It is claimed to speed up getppid() > from 7.31M/sec to 10.65M/sec on Skylake. But here it and other changes > in the last week give a small pessimization for getpid() on Haswell > at 4.08GHz: last week, 100M getpid()'s took 8.27 seconds (12.09M/sec). > Now they take 8.30 seconds (12.05M/sec). (This is with very old > libraries, so there is no possibility of getpid() being done in > userland.) 0.03 seconds is 122.4M cycles. So the speed difference > is 1.224 cycles. Here the timing has a resolution of only 0.01 seconds, > so most digits in this 1.224 are noise, but the difference is apparently > a single cycle. I would have expected more like the "rep movs*" setup > overhead of 25 cycles. > > The mail below only deals with performance claims on amd64. I'll see about gcc 4.2.1 vs 32-bit archs later and other claims later. It is unclear to me whether you actually benchmarked syscall performance before and after the change nor how you measured a slowdown in getpid. This mail outlines what was tested and how. If you want, you can mail me off list and we can arrange so that you get root access to the test box and can measure things yourself, boot your own kernel and whatnot. My preferred way of measurement is with this suite: https://github.com/antonblanchard/will-it-scale Unfortunately it requires a little bit of patching to setup. Upside is cheap processing: there is a dedicated process/thread which collects results once a second, other than that the processes/threads running the test only do the test and bump the iteration counter. getppid looks like this: while (1) { getppid(); (*iterations)++; } If you are interested in trying it out yourself without getting access to the box in question I'm happy to provide a bundle which should be easily compilable. Perhaps you are using the tool which can be found here: tools/tools/syscall_timing It reports significantly lower numbers (even 20% less) because the test loop itself has just more overhead. For the first testing method results are as I wrote in the commit message, with one caveat that I disabled frequency scaling and they went down a little bit (i.e. NOT boosted freq, but having it disabled makes things tad bit slower; *boosted* numbers are significantly bigger but also not reliable). The syscall path has long standing pessimization of using a function pointer to get to the arguments. This fact was utilized to provide different implementations switchable at runtime (thanks to kgdb -w and the set command). The variants include: 1. cpu_fetch_syscall_args This is the routine as present in head, i.e. with inlined memcpy 2. cpu_fetch_syscall_args_bcopy State prior to my changes, i.e. a bcopy call with a variable size 3. cpu_fetch_syscall_args_oolmemcpy Forced to not be inlined memcpy with a constant size, the code itself is the same as for regular memcpy 4. cpu_fetch_syscall_args_oolmemcpy_es Forced to not be inlined memcpy with a constant size, the code itself was modified to utilize the 'Enhanced REP MOVSB/STOSB' bit present on Intel cpus made past 2011 or so. The code can be found here: https://people.freebsd.org/~mjg/copyhacks.diff (note: declarations where not added, so WERROR= or similar is needed to get this to compile) Variants were switched at runtime like this: # kgdb -w (kgdb) set elf64_freebsd_sysvec.sv_fetch_syscall_args=cpu_fetch_syscall_args_bcopy The frequency is fixed with: # sysctl dev.cpu.0.freq=2100 PTI was disabled (vm.pmap.pti=0 in loader.conf). Now, quick numbers from will it scale: (kgdb) set elf64_freebsd_sysvec.sv_fetch_syscall_args=cpu_fetch_syscall_args_bcopy min:7017152 max:7017152 total:7017152 min:7023115 max:7023115 total:7023115 min:7018879 max:7018879 total:7018879 (kgdb) set elf64_freebsd_sysvec.sv_fetch_syscall_args=cpu_fetch_syscall_args min:9914511 max:9914511 total:9914511 min:9915234 max:9915234 total:9915234 min:9914583 max:9914583 total:9914583 But perhaps you don't trust this tool and prefer the in-base one. Note higher overhead of the test infra, thus lower numbers. (kgdb) set elf64_freebsd_sysvec.sv_fetch_syscall_args=cpu_fetch_syscall_args_bcopy getppid 20 1.061986919 6251542 0.000000169 (kgdb) set elf64_freebsd_sysvec.sv_fetch_syscall_args=cpu_fetch_syscall_args_oolmemcpy getppid 79 1.062522431 6245666 0.000000170 (kgdb) set elf64_freebsd_sysvec.sv_fetch_syscall_args=cpu_fetch_syscall_args_oolmemcpy_es getppid 107 1.059988384 7538473 0.000000140 (kgdb) set elf64_freebsd_sysvec.sv_fetch_syscall_args=cpu_fetch_syscall_args getppid 130 1.059987532 8057928 0.000000131 As you can see the original code (doing bcopy) is way slower than the inlined variant. Not taking advantage of the EMRS bit can be now fixed at runtime thanks to recently landed ifunc support. bcopy code can be further depessimized: ENTRY(bcopy) PUSH_FRAME_POINTER xchgq %rsi,%rdi xchg is known to be slow, the preferred way is to swap registers "by hand". the fact that the func always has to do is is a bug on its own. Interestingly memmove *does not have* to do it, so this provides even more reasons to just remove this function in the first place. movq %rdi,%rax subq %rsi,%rax cmpq %rcx,%rax /* overlapping && src < dst? */ jb 1f Avoidable (for memcpy-compatible callers) branch, although statically predicted as not taken (memcpy friendly). shrq $3,%rcx /* copy by 64-bit words */ rep movsq movq %rdx,%rcx andq $7,%rcx /* any bytes left? */ rep movsb The old string copy. Note that reordering andq prior to rep movsb can help some microarchitectures. The movsq -> movsb combo induces significant stalls in the cpu frontend. If the target size is a multiple of 8 and is known at compilation time, we can get away with calling a variant which only deals with this case and thus avoid the extra penalty. Either way, the win *is here* and is definitely not in the range of 3%. -- Mateusz Guzik <mjguzik gmail.com>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAGudoHHKvPKiprG36i=JEWsL39JWB1fZ86QH2Y7DwaJK0=WdLQ>