Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 24 Aug 2020 23:20:10 -0700
From:      Mark Millard <marklmi@yahoo.com>
To:        freebsd-arm <freebsd-arm@freebsd.org>
Subject:   An aarch64 4-core-SBC FreeBSD performance oddity: Rock64 and RPi4B examples
Message-ID:  <255626A7-8731-4849-A5FD-50CBFB1AA6DC@yahoo.com>
References:  <255626A7-8731-4849-A5FD-50CBFB1AA6DC.ref@yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help
The point here is more the systematic bias that might not be
expected and might indicate something is not working as expected.
(The performance details that suggest such are not directly
the point but are the suggestive-evidence of some sort of
bias in the implementation.)

I have found that, under at least FreeBSD head -r363590 , it turns
out that what pair of cpus (cores) are used for a 2 thread activity
in the program I'm using affects the performance measurably and
systematically (but not necessarily by a large amount).

Basically when the cpu-pair involves cpu2 it goes slower than
otherwise when in a context where memory caches help with
performance. (Otherwise RAM slowness or thread creation time
dominates the measurement involved.) But there are 3 contexts
for this:

A) The cpu pair does not involve cpu2. These perform similar
   to each other [relative to (B) and (C) below].

B) cpu2 is in use with one of cpu1 or cpu3. These are different
   from (A) for performance: slower. But the two (B) cases are
   similar to each other [relative to (A) and (C)].

C) cpu2 is in use with cpu0. This slower than both (A) and (B).
   This case also seems to have somewhat more variability in the
   performance compared to (A) and (B).

The Rock64 and RPi4B have very different memory-subsystem
performance-related-behavior overall but the above still applies
to both as a summary. I've not seen such differences for, say,
a RPi4B ubuntu context. I've not tested other example contexts.

I limit the cpus/cores via cpuset use on the command line. I can
build the program involved with or without it locking down each
thread in the test to a distinct cpu/core within what cpuset
is told to allow vs. allowing migration to occur between those
cpus/cores. (No cpuset use would be needed for 4 cores for the
example SBCs.) The effect is measurable both ways.

I test both at the boot -s command prompt and for normal login
command prompts. (Variations in competing activity, including
for RAM cache use.) The effect is measurable both ways.

I have tested two distinct RPi4B's but only have access to
one Rock64. All 3 contexts show the general structure
reported.


As for graphs showing examples . . .

In the graphs for the results the colored curves are the
cpu-pair curves (green, blue, red). I provide dark grey
for single-threaded and 4-core as context for comparison.
If I have any 3 thread examples included for comparison:
light grey.

green: cpu pair does not involve cpu2 (fastest)
red:   cpu pair is cpu0 and cpu2 (slowest)
blue:  cpu pair involved cpu2 but not cpu0 (between)

(The single-threaded curve(s) are the most different
from the others on each SBC so they stand out.)

I'll note that for the x-axis and multi-threaded, being to
the left means thread creation is a larger fraction of the
overall time for the size and that limits the y-axis figure
for the size. (For multi-threaded, thread creations are part
of what is measured for each size problem.)

x-axis: logarithmic for "kernel vectors: total Bytes", base 4
        (a computer oriented indication of the size of the problem)

y-axis: linear for the type of speed figure


A rock64 .png image of an example context's graph is at:

=
https://github.com/markmi/acpphint/blob/master/acpphint_example_data/Rock6=
4-cpu-pairs-oddity.png


A RPi4B .png image of an example context's graph is at:
(The y-axis range is different from Rock64's y-range.)

=
https://github.com/markmi/acpphint/blob/master/acpphint_example_data/RPi4B=
-cpu-pairs-oddity.png

For the RPi4B graph, there is a peak for each color and
both sides of the peak show the issue, but more so on
the left side.


Notes:

The program is a c++17 variant of some of the old
HINT benchmarks. For reference for the data types
involved in the graphed data:

ull: unsigned long long (64 bits here)
ul:  unsigned long      (also 64 bits here)

So variations between the two give some idea of
the degree of other sources of variability in the
measurements (ull and ul are essentially equivalent).

Without the cpu lock down code being built, the
program is not system-specific c++17 code. But
building with the cpu lock down code does add
system-specific code (FreeBSD specific here).

I build with g++ (even when using the system
libc++ and such instead of g++'s libraries). This
is because the program resulting happens to be more
performant in any case that I've compared. Being
more performant makes things easier to notice
when checking for oddities.

Other than building for comparisons to Linux that
uses g++'s libraries, I use the FreeBSD libc++ and
such because they are more performant at creating
threads under FreeBSD (for example). Being more
performant . . .

=3D=3D=3D
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?255626A7-8731-4849-A5FD-50CBFB1AA6DC>