Date: Mon, 24 Aug 2020 23:20:10 -0700 From: Mark Millard <marklmi@yahoo.com> To: freebsd-arm <freebsd-arm@freebsd.org> Subject: An aarch64 4-core-SBC FreeBSD performance oddity: Rock64 and RPi4B examples Message-ID: <255626A7-8731-4849-A5FD-50CBFB1AA6DC@yahoo.com> References: <255626A7-8731-4849-A5FD-50CBFB1AA6DC.ref@yahoo.com>
next in thread | previous in thread | raw e-mail | index | archive | help
The point here is more the systematic bias that might not be expected and might indicate something is not working as expected. (The performance details that suggest such are not directly the point but are the suggestive-evidence of some sort of bias in the implementation.) I have found that, under at least FreeBSD head -r363590 , it turns out that what pair of cpus (cores) are used for a 2 thread activity in the program I'm using affects the performance measurably and systematically (but not necessarily by a large amount). Basically when the cpu-pair involves cpu2 it goes slower than otherwise when in a context where memory caches help with performance. (Otherwise RAM slowness or thread creation time dominates the measurement involved.) But there are 3 contexts for this: A) The cpu pair does not involve cpu2. These perform similar to each other [relative to (B) and (C) below]. B) cpu2 is in use with one of cpu1 or cpu3. These are different from (A) for performance: slower. But the two (B) cases are similar to each other [relative to (A) and (C)]. C) cpu2 is in use with cpu0. This slower than both (A) and (B). This case also seems to have somewhat more variability in the performance compared to (A) and (B). The Rock64 and RPi4B have very different memory-subsystem performance-related-behavior overall but the above still applies to both as a summary. I've not seen such differences for, say, a RPi4B ubuntu context. I've not tested other example contexts. I limit the cpus/cores via cpuset use on the command line. I can build the program involved with or without it locking down each thread in the test to a distinct cpu/core within what cpuset is told to allow vs. allowing migration to occur between those cpus/cores. (No cpuset use would be needed for 4 cores for the example SBCs.) The effect is measurable both ways. I test both at the boot -s command prompt and for normal login command prompts. (Variations in competing activity, including for RAM cache use.) The effect is measurable both ways. I have tested two distinct RPi4B's but only have access to one Rock64. All 3 contexts show the general structure reported. As for graphs showing examples . . . In the graphs for the results the colored curves are the cpu-pair curves (green, blue, red). I provide dark grey for single-threaded and 4-core as context for comparison. If I have any 3 thread examples included for comparison: light grey. green: cpu pair does not involve cpu2 (fastest) red: cpu pair is cpu0 and cpu2 (slowest) blue: cpu pair involved cpu2 but not cpu0 (between) (The single-threaded curve(s) are the most different from the others on each SBC so they stand out.) I'll note that for the x-axis and multi-threaded, being to the left means thread creation is a larger fraction of the overall time for the size and that limits the y-axis figure for the size. (For multi-threaded, thread creations are part of what is measured for each size problem.) x-axis: logarithmic for "kernel vectors: total Bytes", base 4 (a computer oriented indication of the size of the problem) y-axis: linear for the type of speed figure A rock64 .png image of an example context's graph is at: = https://github.com/markmi/acpphint/blob/master/acpphint_example_data/Rock6= 4-cpu-pairs-oddity.png A RPi4B .png image of an example context's graph is at: (The y-axis range is different from Rock64's y-range.) = https://github.com/markmi/acpphint/blob/master/acpphint_example_data/RPi4B= -cpu-pairs-oddity.png For the RPi4B graph, there is a peak for each color and both sides of the peak show the issue, but more so on the left side. Notes: The program is a c++17 variant of some of the old HINT benchmarks. For reference for the data types involved in the graphed data: ull: unsigned long long (64 bits here) ul: unsigned long (also 64 bits here) So variations between the two give some idea of the degree of other sources of variability in the measurements (ull and ul are essentially equivalent). Without the cpu lock down code being built, the program is not system-specific c++17 code. But building with the cpu lock down code does add system-specific code (FreeBSD specific here). I build with g++ (even when using the system libc++ and such instead of g++'s libraries). This is because the program resulting happens to be more performant in any case that I've compared. Being more performant makes things easier to notice when checking for oddities. Other than building for comparisons to Linux that uses g++'s libraries, I use the FreeBSD libc++ and such because they are more performant at creating threads under FreeBSD (for example). Being more performant . . . =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?255626A7-8731-4849-A5FD-50CBFB1AA6DC>