Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 6 May 2018 16:32:58 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Mateusz Guzik <mjguzik@gmail.com>
Cc:        Bruce Evans <brde@optusnet.com.au>, Conrad Meyer <cem@freebsd.org>,  Brooks Davis <brooks@freebsd.org>,  src-committers <src-committers@freebsd.org>, svn-src-all@freebsd.org,  svn-src-head@freebsd.org
Subject:   Re: svn commit: r333240 - in head/sys: powerpc/powerpc sys
Message-ID:  <20180506123307.F1132@besplex.bde.org>
In-Reply-To: <CAGudoHHKvPKiprG36i=JEWsL39JWB1fZ86QH2Y7DwaJK0=WdLQ@mail.gmail.com>
References:  <201805040400.w4440moH025057@repo.freebsd.org> <20180504155301.GA56280@spindle.one-eyed-alien.net> <CAG6CVpUr_2b-yT1-uExcY1Tvpg=-P3y_owNHQ0UPg604to8Y0Q@mail.gmail.com> <20180505090954.X1307@besplex.bde.org> <CAGudoHHKvPKiprG36i=JEWsL39JWB1fZ86QH2Y7DwaJK0=WdLQ@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 5 May 2018, Mateusz Guzik wrote:

> On Sat, May 5, 2018 at 2:38 AM, Bruce Evans <brde@optusnet.com.au> wrote:
>
> I don't believe the claimed speeding of using the optimized bcopy()
>> in cpu_fetch_sycall_args() on amd64 (r333241).  This changes the copy
>> size from a variable to 6*8 bytes.  It is claimed to speed up getppid()
>> from 7.31M/sec to 10.65M/sec on Skylake.  But here it and other changes
>> in the last week give a small pessimization for getpid() on Haswell
>> at 4.08GHz: last week, 100M getpid()'s took 8.27 seconds (12.09M/sec).
>> Now they take 8.30 seconds (12.05M/sec).  (This is with very old
>> libraries, so there is no possibility of getpid() being done in
>> userland.)  0.03 seconds is 122.4M cycles.  So the speed difference
>> is 1.224 cycles.  Here the timing has a resolution of only 0.01 seconds,
>> so most digits in this 1.224 are noise, but the difference is apparently
>> a single cycle.  I would have expected more like the "rep movs*" setup
>> overhead of 25 cycles.
>>
> The mail below only deals with performance claims on amd64. I'll see about
> gcc 4.2.1 vs 32-bit archs later and other claims later.

Well, I saw about gcc-4.2.1 before because that is what I use.  I now know
the full details.

r333240 (bcopy -> { __builtin_memmove or bcopy } in systm.h) had no significant
effect for this benchmark since the bcopy() is still used and still has a
variable size in cpu_fetch_syscall_args().

r333241 (regcnt -> 6 args in trap.c) gave a tiny pessimization for gcc on
amd64 (also clang on i386?), by using one of the bugs in r333240.  As
explained in a previous reply, __builtin_memmove() is never inlined for
gcc-4.2.1, but is converted to memmove().  memmove() is a wrapper for
bcopy(), so must be slightly slower.  It is less than 0.1% slower for
this benchmark according to the times below (but the timing is not done
carefully enough to distinguish this from noise).

r333265 (memcpy -> __builtin_memcpy in systm.h) has no effect for the
benchmark yet

r333266 (bcopy -> memcpy in trap.c) gives a significant optimization for
gcc too (at least on amd64), since r33265 make memcpy == __builtin_memcpy
and __builtin_memcpy() actually works for gcc-4.2.1, at least on amd64.

Timing for this: 10**8 getpid() calls (lowest of several samples with
"cpuset -l 6 time ./getpid-test"):

  r333291:     6.67 real         1.21 user         5.46 sys
~r333241:     8.04 real         1.21 user         6.82 sys
~r333240:     8.03 real         1.22 user         6.81 sys

where ~r33324[0-1] is r333291 with only trap.c changed to r33324[0-1]24.

This benchmark is sensitive enough to show scheduler problems.  I use
SCHED_4BSD, and have mostly fixed it to work better than SCHED_ULE for
makeworld and for micro-benchmarks like this, but not in this branch.
I used cpuset -l N here since I didn't want to see any scheduler problems,
but instead it showed them.  Kernel threads bound to (fixed) CPUs interfere
with scheduling of other threads.  The test system has 4 cores with 2
threads/core.  It has kernel threads bound to CPUs 0 and 4 taking about
1.5% of each of these CPUs (too much).  This is shown by the ~r333240
benchmark taking about 0.12 seconds longer when restricted to either one
of these CPUs.  Even SCHED_4BSD knows to mostly avoid using these CPUs
when not restricted (and there are other CPUs to spare).  But it doesn't
know to avoid using the CPUs 1 and 5 which share resouces with these CPUs.
The benchmark takes about 0.01 seconds longer when restricted to either
CPU 1 or 5, or less than that but with more variance when not restricted.
I used cpuset -l 6 to avoid the variance.

Further analysis of these times:

8.03 seconds at 4.08GHz is 327.6e8 cycles or 327.6 cycles/call
8.04 seconds at 4.08GHz is 328.0e8 cycles or 328.0 cycles/call
6.67 seconds at 4.08GHz is 272.1e8 cycles or 272.1 cycles/call

The difference between 327.6 and 328 cycles is in the noise.  This shows
that the cost of memmove() calling bcopy() is in the noise.

The difference between 328.0 and 272.1 cycles is 55.9 cycles.  This is
almost exactly the startup overhead of the 2 "rep movs"'s in bcopy().

> It is unclear to me whether you actually benchmarked syscall performance
> before and after the change nor how you measured a slowdown in getpid.

I just used my 30 year old 9-line benchmark for getpid() and edited it a bit
to increase the number of iterations from 10**6 to 10**8.  (I try to keep
this benchmark invariant down to the binary level, but only have i386
binaries.)

> This mail outlines what was tested and how.
>
> If you want, you can mail me off list and we can arrange so that you get
> root access to the test box and can measure things yourself, boot your
> own kernel and whatnot.
> 
> My preferred way of measurement is with this suite:
> https://github.com/antonblanchard/will-it-scale
>
> Unfortunately it requires a little bit of patching to setup.

I prefer my own (mostly simpler) benchmarks.  The difficulty is to control
the configuration including remembering which system I'm running on when
modifying configurations, even with only 1 server and 3 test systems here.
>
> Upside is cheap processing: there is a dedicated process/thread which
> collects results once a second, other than that the processes/threads
> running the test only do the test and bump the iteration counter.
> getppid looks like this:
>        while (1) {
>                getppid();
>                (*iterations)++;
>        }
>
> If you are interested in trying it out yourself without getting access to
> the box in question I'm happy to provide a bundle which should be easily
> compilable.

Good for lots of different benchmarks, but I usually prefer deeper anlysis
of a single benchmark.

> Perhaps you are using the tool which can be found here:
> tools/tools/syscall_timing
>
> It reports significantly lower numbers (even 20% less) because the test
> loop itself has just more overhead.

My results above show almost 25% overhead according to the user:sys ratio.
This ratio has a high variance, but I almost trust it.  For my makeworld
benchmark, I recently rewrote the timing to not trust any of the times
reported by getrusage(), but to use raw ticks.  getrusage() still doesn't
even report interrupt times, and a large amount of the sys time is now
hidden in kernel thread times (interrupt time in user threads is closer
to 0 than it used to be, thus less needed, since interrupt time is now
mostly hidden in kernel ithreads or unavailable for "fast" interrupt
handlers).  getrusage() is especially inaccurate for the short-lived
processes that are typical for compiling -- it has to convert up ticks
into user and sys times, and for short-lived processes the tick counts
are usually 0 and 0 or 1 and 0 when it is called in exit().

> For the first testing method results are as I wrote in the commit message,
> with one caveat that I disabled frequency scaling and they went down a
> little
> bit (i.e. NOT boosted freq, but having it disabled makes things tad bit
> slower; *boosted* numbers are significantly bigger but also not reliable).

I almost always do that here.  My Haswell system always throttles in Turbo
mode, so Turbo mode gives a speedup of about 1% with high variance instead
of the 10% that it capable of for short bursts.

> The syscall path has long standing pessimization of using a function pointer
> to get to the arguments. This fact was utilized to provide different
> implementations switchable at runtime (thanks to kgdb -w and the set
> command).

That is actually a not very long standing pessimization.  My server runs
FreeBSD-~5.2 and doesn't have it.  Its syscalls are about 30% faster on
i386 IIRC.  i386 syscalls are have more fundamental (hardware) overhead
than amd64 syscalls, so other pessimizations are relatively larger.

My i586(npx)-optimized bcopy() had problems with its function pointers.
They cost a lot according in some benchmarks.  Other benchmarks showed that
the problem is more with general cache misses or just branch target cache
misses.  A couple of cycles for the indirection is nothing compared with
"rep cmps" startup overhead.

> The variants include:
> 1. cpu_fetch_syscall_args
>
> This is the routine as present in head, i.e. with inlined memcpy
>
> 2. cpu_fetch_syscall_args_bcopy
>
> State prior to my changes, i.e. a bcopy call with a variable size
>
> 3. cpu_fetch_syscall_args_oolmemcpy
>
> Forced to not be inlined memcpy with a constant size, the code itself
> is the same as for regular memcpy
>
> 4. cpu_fetch_syscall_args_oolmemcpy_es

Some of these functions are not declared inline, but are automatically
inlined by clang whether you want this or not.  I don't want this or
clang, so I use gcc and uses its options to turn off inlining
(-fno-inline, which is needed to unbreak -Os, and then -Os, and
-fno-inline-functions-called-once).  Also -mtune-i386 for i386, to
prevent generation of instructions that ddb can't disassemble.  So the
getpid() benchmark and some others run a bit slower for me.  I mostly
only care about the makeworld benchmark.  This runs less than 1% slower
without the fancy optimizations.  Then I speed it up by 20% sys time or
2% real time with my own fancy optimizations.

> Forced to not be inlined memcpy with a constant size, the code itself
> was modified to utilize the 'Enhanced REP MOVSB/STOSB' bit present on
> Intel cpus made past 2011 or so.

This is not so easy to do without pessimizing systems which don't have
fast-strings.  But x86 memcpy() almost gives this unintentionally.  It
was supposed to be a simple fallback for cases where memcpy() wasn't
inlined, back when memcpy() was automatically converted to
__builtin_memcpy().  memcpy() was supposed to be used instead of bcopy()
only for small fixed-size copying (other uses are style bugs).  But
the builtin is never used for -O0 and in some other cases (mostly when
memcpy() should not have been used).  The fallback handles these cases.
It was intentionally simple and very optimized, and could have used
"rep movsb" the be even simpler and inhibit the style bugs by running
slower.  But with fast-strings, the simpler version is also faster.

> The code can be found here: https://people.freebsd.org/~mjg/copyhacks.diff
> (note: declarations where not added, so WERROR= or similar is needed to
> get this to compile)

I have the following versions of copying functions which you don't want
to find details of:

static struct func funcs[] =
{
     { copy0, "copy0", MOVS, },
     { copy1, "copy1", "unroll *4", },
     { copy2, "copy2", "unroll *4 prefetch", },
     { copy3, "copy3", "unroll *16 i586-opt", },
     { copy4, "copy4", "unroll *16 i586-opt prefetch", },
     { copy5, "copy5", "unroll *16 i586-opx prefetch", },
     { copy6, "copy6", "unroll *8 prefetch 4", },
     { copy7, "copy7", "unroll 64 fp++", },
     { copy8, "copy8", "unroll 128 fp i-prefetch", },
     { copy9, "copy9", "unroll 64 fp reordered", },
     { copyA, "copyA", "unroll 256 fp reordered++", },
     { copyB, "copyB", "unroll 512 fp reordered++", },
     { copyC, "copyC", "Terje cksum", },
#ifdef __i386__
     { copyD, "copyD", "kernel bcopy (unroll 64 fp i-prefetch)", },
#endif
     { copyE, "copyE", "unroll 64 fp i-prefetch++", },
#ifdef __i386__
     { copyF, "copyF", "new kernel bcopy (unroll 64 fp i-prefetch++)", },
#endif
     { copyG, "copyG", "memcpy ("MOVS")", },
     { copyH, "copyH", "movntps", V_ATHSSE | V_SSE2, },
     { copyI, "copyI", "movntps with prefetchnta", V_ATHSSE | V_SSE2, },
     { copyJ, "copyJ", "movntps with block prefetch", V_ATHSSE | V_SSE2, },
     { copyK, "copyK", "movq", V_MMX, },
     { copyL, "copyL", "movq with prefetchnta", V_SSE, },
     { copyM, "copyM", "movq with block prefetch", V_MMX, },
     { copyN, "copyN", "movntq", V_ATHSSE, },
     { copyO, "copyO", "movntq with prefetchnta", V_ATHSSE, },
     { copyP, "copyP", "movntq with block prefetch", V_ATHSSE, },
     { copyQ, "copyQ", "movdqa", V_SSE2, },
     { copyR, "copyR", "movdqa with prefetchnta", V_SSE2, },
     { copyS, "copyS", "movdqa with block prefetch", V_SSE2, },
     { copyT, "copyT", "unroll *8 a64-opt", },
     { copyU, "copyU", "unroll *8 a64-opt with prefetchnta", V_SSE, },
     { copyV, "copyV", "unroll *8 a64-opt with block prefetch", },
     { copyW, "copyW", "movnti", V_SSE2, },
     { copyX, "copyX", "movnti with prefetchnta", V_SSE2, },
     { copyY, "copyY", "movnti with block prefetch", V_SSE2, },
     { copyZ, "copyZ", "i686_memcpy( movdqa)", V_SSE2, },
     { copya, "copya", "~i686_memcpy (movaps)", V_SSE, },
};

These only attempt to be fast for large blocks (about 4K and up).  They
are all only and show the problem space for ~1996-2004:
- movs* and bcopy were mediocre for CPUs in that era.  They had too large
   a startup overhead for small sizes and didn't know enough about caches
   for large sizes.
- various unrolling and fp versions are mostly for i586's (P1's, where it
   best methods had to manage the U and V pipes)
- cksum while copying was once almost free, but was worst then and is even
   worse now
- various methods using MMX and SSE were slightly better than copying through
   even 64-bit integer registers then, but not enough better to be worth using.
   Now AVX is enough better to be worth using, but all of these methods are
   unavilable in the kernel, except by switching the "fpu" state which has
   much worse startup overhead than "rep movs*".
- methods using nontemporal stores were barely worth using.  amd64 used them
   unconditionally for pagezero before you broke this for the systems that
   benefit from it and I restored conditional use.  I use them for pagecopy
   and pagezero on AthlonXP.  Now caches are better and these methods are
   not worth using (on the newer CPUs with better caches).
- methods using explicit prefetching were barely worth using on AthlonXP,
   but not worth using as early as Athlon64.

Things are much more complicated now, unless you drop support for all except
the newest high-end (x86) CPUs (assume AVX).

> Variants were switched at runtime like this:
> # kgdb -w
> (kgdb) set
> elf64_freebsd_sysvec.sv_fetch_syscall_args=cpu_fetch_syscall_args_bcopy
>
> The frequency is fixed with:
>
> # sysctl dev.cpu.0.freq=2100
>
> PTI was disabled (vm.pmap.pti=0 in loader.conf).

I also have PTI disabled for amd64, and didn't downgrade i386 to the slow
versions where PTI is not optional.

> Now, quick numbers from will it scale:
> (kgdb) set
> elf64_freebsd_sysvec.sv_fetch_syscall_args=cpu_fetch_syscall_args_bcopy
>
> min:7017152 max:7017152 total:7017152
> min:7023115 max:7023115 total:7023115
> min:7018879 max:7018879 total:7018879

Cycle counts would be easier to understand.  They scale fairly well between
all X86's newer than a P6 (the number of pipelines and the throughputs are
almost the same.  Caches are most different but micro-benchamrks avoid
cache effects too much).

> ...
> As you can see the original code (doing bcopy) is way slower than the
> inlined variant.

The slowness seems to be almost all in the setup overhead, not quite as
I predicted.  With the following trivial fix for bcopy:

XX Index: support.S
XX ===================================================================
XX --- support.S	(revision 333291)
XX +++ support.S	(working copy)
XX @@ -135,8 +135,10 @@
XX  	movsq
XX  	movq	%rdx,%rcx
XX  	andq	$7,%rcx				/* any bytes left? */
XX +	je	666f
XX  	rep
XX  	movsb
XX +666:
XX  	POP_FRAME_POINTER
XX  	ret
XX

and r333266 modified back to using real bcopy:

XX Index: trap.c
XX ===================================================================
XX --- trap.c	(revision 333291)
XX +++ trap.c	(working copy)
XX @@ -908,7 +908,7 @@
XX  	error = 0;
XX  	argp = &frame->tf_rdi;
XX  	argp += reg;
XX -	memcpy(sa->args, argp, sizeof(sa->args[0]) * 6);
XX +	(bcopy)(argp, sa->args, sizeof(sa->args[0]) * 6);
XX  	if (sa->narg > regcnt) {
XX  		KASSERT(params != NULL, ("copyin args with no params!"));
XX  		error = copyin(params, &sa->args[regcnt],

the time improves by much more than the predicted 25 cycles (times repeated
from above):

  r333291:     6.67 real         1.21 user         5.46 sys (inline memcpy)
~r333241:     8.04 real         1.21 user         6.82 sys (extern memmove)
~r333240:     8.03 real         1.22 user         6.81 sys (extern becopy)
above:        6.86 real         1.11 user         5.74 sys (extern fixed bcopy)

That is, 280 cycles (above), which is only 8 cycles slower than the inlined
version.  Maybe there is more than 1 bcopy per getpid(), or the startup
overhead is only high for the null "rep movsb" because it depends on the
previous non-null one, so stalls the pipleines, while dependencies on the
previous one are far enough away so that it can run mostly in parallel.

> Not taking advantage of the EMRS bit can be now fixed at runtime thanks to
> recently landed ifunc support.

I think I won't like the complications from that.

> bcopy code can be further depessimized:
>
> ENTRY(bcopy)
>        PUSH_FRAME_POINTER
>        xchgq   %rsi,%rdi
>
> xchg is known to be slow, the preferred way is to swap registers "by hand".
> the fact that the func always has to do is is a bug on its own.

Really?  I see that in the 2002 Athlon optimization manual, xchg of different
registers is vector path and takes 2 cycles.  3 mov's to exchange would take
about the same number of cycles since they have dependencies.  buth both
can be reordered, so the dependencies shouldn't matter much.  E.g., the
above xchg can be done before PUSH_FRAME_POINTER either directly or maybe
the CPU can move it.  Or just don't use eiher register immiedately.

> Interestingly
> memmove *does not have* to do it, so this provides even more reasons to just
> remove this function in the first place.

Nah. register moves are almost free.  They can be done in parallel
with other operations (like the cld that used to be here).  Except on
in-order x86 like Atom I guess.  Almost all functions would be much
slower without this.  In compiled code, the compiler should generate
a good order.  This asm code is still "optimized" for
in-order-original-i386 (probably badly even for that), but CPUs now
do enough reordering to handle most function entries well.  We also
depend on this for PUSH_FRAME_POINTER to not be too slow.  i386 was
"optimised" by not doing this, and it really was an optimization before
P6, and it still doesn't do this.

Also, i386 bcopy() doesn't have this problem.  Before it was aessively
pessimized by 4+4, it just loaded the registers off the stack into the
right registers to begin with.

The long Linux thread in 2010 pointed to by danfe says to do the opposite
(for the fairly bogus reason than memcpy() was misused, but bcopy() doesn't
have the problem).

>
>        movq    %rdi,%rax
>        subq    %rsi,%rax
>        cmpq    %rcx,%rax                       /* overlapping && src <
> dst? */
>        jb      1f
>
> Avoidable (for memcpy-compatible callers) branch, although statically
> predicted
> as not taken (memcpy friendly).

Static prediction only helps much for the first call.  Another reason why
memmove() shouldn't exist is that when there are multiple alternative
functions that are not just aliases, each takes resources like branch
prediction cache entries, so the alternatives are less likely to all
stay cached, or if they do then they prevent the resources being used
elsewhere.

>
>        shrq    $3,%rcx                         /* copy by 64-bit words */
>        rep
>        movsq
>        movq    %rdx,%rcx
>        andq    $7,%rcx                         /* any bytes left? */
>        rep
>        movsb
>
> The old string copy. Note that reordering andq prior to rep movsb can help
> some microarchitectures.

Are there really any that stupid?  In-order ones are equally slow wherever
the instructions are, and out-of-order ones have an especially large amount
of time to reorder this andq.

> The movsq -> movsb combo induces significant stalls in the cpu frontend.

Indeed, as my last benchmark result shows.

I tried avoiding this "rep movsb" and other optimizations several years
ago (probably several times) but never noticed much effect from this before.
CPUs seemed to be too smart about reordering.

> If the target size is a multiple of 8 and is known at compilation time, we
> can get away with calling a variant which only deals with this case and
> thus avoid the extra penalty.

I think that is not a useful optimization, since the cost of handling this
case is approximately 0 cycles plus a branch target cache resource for
"jz 666" (1 cycle for a predicted branch, but hopefully that can be in
parallel).  One of the old i386 "optimizations" was to avoid branches like
this since i386's had no branch target cache.

> Either way, the win *is here* and is definitely not in the range of 3%.

Still 0.3% outside of benchmarks :-).

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180506123307.F1132>