Date: Sat, 7 Jul 2018 03:55:35 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: John Baldwin <jhb@freebsd.org> Cc: rgrimes@freebsd.org, Warner Losh <imp@bsdimp.com>, Hans Petter Selasky <hselasky@freebsd.org>, src-committers <src-committers@freebsd.org>, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r336025 - in head/sys: amd64/include i386/include Message-ID: <20180707031245.J2611@besplex.bde.org> In-Reply-To: <1f87b7ba-3b59-e710-00b0-91a4b0e4e5b4@FreeBSD.org> References: <201807061552.w66Fq0FX052931@pdx.rh.CN85.dnsmgr.net> <1f87b7ba-3b59-e710-00b0-91a4b0e4e5b4@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 6 Jul 2018, John Baldwin wrote: > On 7/6/18 8:52 AM, Rodney W. Grimes wrote: >> ... >> Trivial to fix this with >> +#if defined(SMP) || !defined(_KERNEL) || defined(KLD_MODULE) || !defined(KLD_UP_MODULES) > > This is not worth it. Note that we already use LOCK always in userland > which is probably far more prevalent than the use in modules. > > Previously atomics in modules were _function calls_ just to avoid the LOCK. > Having the LOCK prefix present even on UP is probably far more efficient > than a function call. No, the lock prefix is less efficient. IIRC, on very old systems (~PPro), lock prefixes cost 20 cycles in the UP case. On AthlonXP, they cost about 19 cycles, but function calls (written in C) only cost about 6 cycles. This depends on pipelining, and my test is perhaps too simple since it uses a loop where the pipelinig works especially well (it executes 2 or 3 function calls in parallel). Actually timing on AthlonXP UP: - asm loop: 2 cycles/iteration - "incl mem" in asm loop: 5.85 cycles (but with less alignment, only 3.25 cycles) - "lock; incl mem" in asm loop: 18.9 cycles - function call in C loop to C function doing "incl mem" in asm: 8.35 cycles - function call in C loop to C function doing "lock; incl mem" in asm: 24.95 cycles. Newer CPUs have better pipelining. On Haswell, this gives the strange behaviour that the function call written in C is slightly faster than inline code written in asm: Actual timing on Haswell SMP: - asm loop: 1.16 cycles/iteration - "incl mem" in asm loop: 6.95 cycles - "lock; incl mem" in asm loop: 19.00 cycles - function call in C loop to C function doing "incl mem" in asm: 6 cycles - function call in C loop to C function doing "lock; incl mem" in asm: 26.00 cycles. The C code with the function call executes: loop: call incl incl: pushl %ebp movl %ebp,%esp [lock;] incl mem leave ret incl %ebx cmpl $4080000000-1,%ebx jbe done I didn't even compile with -fframe-pointer or try clang which would do excessive unrolling. -fframe-pointer takes 3 extra instructions in incl, but these take no extra time. In non-benchmark use, there would be more args for the function call so and the scheduling would be very different so the timing might be very different. I expect the function call would be insignificantly slower except in micro-benchmarks, Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180707031245.J2611>