Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 13 May 2023 12:49:46 -0700
From:      Mark Millard <marklmi@yahoo.com>
To:        freebsd-arm <freebsd-arm@freebsd.org>
Subject:   Re: -mcpu= selections and the Windows Dev Kit 2023: example from-scratch buildkernel times (after kernel-toolchain)
Message-ID:  <049ED1F8-CA62-4564-8635-4EFCF008ED9D@yahoo.com>
In-Reply-To: <6196193E-4A75-464C-AB0B-AE2C3BC00D66@yahoo.com>
References:  <3B5EB0DD-E9CB-41BD-9BCC-6549BBF0C0DA@yahoo.com> <6196193E-4A75-464C-AB0B-AE2C3BC00D66@yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On May 13, 2023, at 01:50, Mark Millard <marklmi@yahoo.com> wrote:

> On May 13, 2023, at 01:28, Mark Millard <marklmi@yahoo.com> wrote:
>=20
>> While the selections were guided by some benchmark like
>> explorations, the results for the Windows Dev Kit 2023
>> (WDK23 abbreviation) go like:
>>=20
>>=20
>> -mcpu=3Dcortex-a72 code generation produced a (non-debug)
>> kernel/world that, in turn, got (from scratch buildkernel after
>> kernel-toolchain):
>>=20
>> Kernel(s)  GENERIC-NODBG-CA72 built in 597 seconds, ncpu: 8, make -j8
>>=20
>> (The rest of the aarch64 that I've access to is nearly-all cortex-a72
>> based, the others being cortex-a53 these days. So I was seeing how
>> code tailored for the cortex-a72 context performed on the WDK23.
>> cortex-a72 was my starting point with the WDK23.)
>>=20
>>=20
>> -mcpu=3Dcortex-x1c+flagm code generation produced a (non-debug)
>> kernel/world that, in turn, got (from scratch buildkernel after
>> kernel-toolchain):
>>=20
>> Kernel(s)  GENERIC-NODBG-CA78C built in 584 seconds, ncpu: 8, make =
-j8
>>=20
>> NOTE: "+flagm" is because of various clang/gcc having an inaccurate
>> set of features that omit flagm --and I'm making sure I've got it
>> enabled. -mcpu=3Dcortex-a78c is even worse: it has examples of =
+fp16fml
>> by default in some toolchains --but neither of the 2 types of core =
has
>> support for such. (The cortex-x1c and cortex-a78c actually have =
matching
>> features for code generation purposes, at least for all that I looked
>> at. Toolchain mismatches for default features are sufficient evidence
>> of an error in at least one case as far as I can tell.)
>>=20
>> This context is implicitly +lse+rcpc . At the time I was not being
>> explicit when defaults matched.
>>=20
>> Notes:
>> "lse" is the large system extension atomics, disabled below.
>> "rcpc" is the extension having load acquire and store release
>> instructions. (rcpc I was explicit about below, despite the
>> default matching.)
>>=20
>>=20
>> -mcpu=3Dcortex-x1c+flagm+nolse+rcpc code generation produced a
>> (non-debug) kernel/world that, in turn, got (from scratch buildkernel
>> after kernel-toolchain):
>>=20
>> Kernel(s)  GENERIC-NODBG-CA78CnoLSE built in 415 seconds, ncpu: 8, =
make -j
>>=20
>> Note: My explorations so far have tried the world combinations of
>> lse and rcpc status but with a kernel that was based on
>> -mcpu=3Dcortex-x1c+flagm . I then updated the kernel to match the
>> -mcpu=3Dcortex-x1c+flagm+nolse+rcpc and used it to produce the above.
>> So there is more exploring that I've not done yet. But I'm not
>> expecting decreases to notably below the 415 sec.
>>=20
>> The benchmark like activity had showed that +lse+rcpc for the
>> world/benchmark builds lead to notable negative consequences for
>> cpus 0..3 compared to the other 3 combinations of status. For
>> cpus 4..7, it showed that +nolse+rcpc for the world/benchmark
>> builds had a noticeable gain compared to the other 3 combinations.
>> This guided the buildkernel testing selections done so far. The
>> buildkernel tests were, in part, to be sure that the apparent
>> consequences were not just odd consequences for time measurements
>> that could mess up benchmark result comparisons being useful.
>>=20
>>=20
>> For comparison to a standard FreeBSD non-debug build, I used a
>> snapshot download of:
>>=20
>> =
http://ftp3.freebsd.org/pub/FreeBSD/snapshots/ISO-IMAGES/13.2/FreeBSD-13.2=
-STABLE-arm64-aarch64-ROCK64-20230504-7dea7445ba44-255298.img.xz
>>=20
>> and dd'd it to media, replaced the EFI/*/* with ones that
>> work for the Windows Dev Kit 2023, booted the WDK23 with the media,
>> copied over my /usr/*-src/ to the media, did a "make -j8 =
kernel-toolchain",
>> from the /usr/main-src/ copy and finally did a "make -j8 buildkernel"
>> (so, from-scratch, given the toolchain materials are already in =
place):
>>=20
>> Kernel(s)  GENERIC built in 505 seconds, ncpu: 8, make -j8
>>=20
>> ( /usr/main-src/ has the source that the other buildkernel timings
>> were based on. )
>>=20
>>=20
>> Looks like -mcpu=3Dcortex-a72 and -mcpu=3Dcortex-x1c+flagm are far =
from
>> a good fit for buildkernel workloads to run under on the WDK23. =
FreeBSD
>> defaults and -mcpu=3Dcortex-x1c+flagm+nolse+rcpc seems to be better =
fits
>> for such use.
>>=20
>>=20
>> Note: This testing was in a ZFS context, using bectl to advantage, in
>> case that somehow matters.
>>=20
>>=20
>> For reference:
>>=20
>> # grep mcpu=3D /usr/main-src/sys/arm64/conf/GENERIC-NODBG-CA78C
>> makeoptions CONF_CFLAGS=3D"-mcpu=3Dcortex-x1c+flagm+nolse+rcpc"
>>=20
>> # grep mcpu=3D ~/src.configs/*CA78C-nodbg*
>> XCFLAGS+=3D -mcpu=3Dcortex-x1c+flagm+nolse+rcpc
>> XCXXFLAGS+=3D -mcpu=3Dcortex-x1c+flagm+nolse+rcpc
>> ACFLAGS.arm64cpuid.S+=3D  -mcpu=3Dcortex-x1c
>> ACFLAGS.aesv8-armx.S+=3D  -mcpu=3Dcortex-x1c
>> ACFLAGS.ghashv8-armx.S+=3D        -mcpu=3Dcortex-x1c
>>=20
>> # more /usr/local/etc/poudriere.d/main-CA78C-make.conf
>> CFLAGS+=3D -mcpu=3Dcortex-x1c+flagm+nolse+rcpc
>> CXXFLAGS+=3D -mcpu=3Dcortex-x1c+flagm+nolse+rcpc
>> CPPFLAGS+=3D -mcpu=3Dcortex-x1c+flagm+nolse+rcpc
>> RUSTFLAGS_CPU_FEATURES=3D -C target-cpu=3Dcortex-x1c -C =
target-feature=3D+x1c,+flagm,-lse,+rcpc
>=20
> Note: RUSTFLAGS_CPU_FEATURES is something that I added to my
> environment to allow the experiment:
>=20
> # git -C /usr/ports/ diff Mk/Uses/cargo.mk
> diff --git a/Mk/Uses/cargo.mk b/Mk/Uses/cargo.mk
> index 50146372fee1..2f21453fd02b 100644
> --- a/Mk/Uses/cargo.mk
> +++ b/Mk/Uses/cargo.mk
> @@ -145,7 +145,9 @@ WITH_LTO=3D   yes
> .  endif
>   # Adjust -C target-cpu if -march/-mcpu is set by bsd.cpu.mk
> -.  if ${ARCH} =3D=3D amd64 || ${ARCH} =3D=3D i386
> +.  if defined(RUSTFLAGS_CPU_FEATURES)
> +RUSTFLAGS+=3D    ${RUSTFLAGS_CPU_FEATURES}
> +.  elif ${ARCH} =3D=3D amd64 || ${ARCH} =3D=3D i386
> RUSTFLAGS+=3D    ${CFLAGS:M-march=3D*:S/-march=3D/-C target-cpu=3D/}
> .  elif ${ARCH:Mpowerpc*}
> RUSTFLAGS+=3D    ${CFLAGS:M-mcpu=3D*:S/-mcpu=3D/-C =
target-cpu=3D/:S/power/pwr/}
>=20
>> diff --git a/secure/lib/libcrypto/Makefile =
b/secure/lib/libcrypto/Makefile
>> index 8fde4f19d046..e13227d6450b 100644
>> --- a/secure/lib/libcrypto/Makefile
>> +++ b/secure/lib/libcrypto/Makefile
>> @@ -22,7 +22,7 @@ SRCS+=3D        mem.c mem_dbg.c mem_sec.c o_dir.c =
o_fips.c o_fopen.c o_init.c
>> SRCS+=3D o_str.c o_time.c threads_pthread.c uid.c
>> .if defined(ASM_aarch64)
>> SRCS+=3D arm64cpuid.S armcap.c
>> -ACFLAGS.arm64cpuid.S=3D  -march=3Darmv8-a+crypto
>> +ACFLAGS.arm64cpuid.S+=3D -march=3Darmv8-a+crypto
>> .elif defined(ASM_amd64)
>> SRCS+=3D x86_64cpuid.S
>> .elif defined(ASM_arm)
>> @@ -43,7 +43,7 @@ SRCS+=3D        mem_clr.c
>> SRCS+=3D aes_cbc.c aes_cfb.c aes_ecb.c aes_ige.c aes_misc.c aes_ofb.c =
aes_wrap.c
>> .if defined(ASM_aarch64)
>> SRCS+=3D aes_core.c aesv8-armx.S vpaes-armv8.S
>> -ACFLAGS.aesv8-armx.S=3D  -march=3Darmv8-a+crypto
>> +ACFLAGS.aesv8-armx.S+=3D -march=3Darmv8-a+crypto
>> .elif defined(ASM_amd64)
>> SRCS+=3D aes_core.c aesni-mb-x86_64.S aesni-sha1-x86_64.S =
aesni-sha256-x86_64.S
>> SRCS+=3D aesni-x86_64.S vpaes-x86_64.S
>> @@ -278,7 +278,7 @@ SRCS+=3D      cbc128.c ccm128.c cfb128.c ctr128.c =
cts128.c gcm128.c ocb128.c
>> SRCS+=3D ofb128.c wrap128.c xts128.c
>> .if defined(ASM_aarch64)
>> SRCS+=3D ghashv8-armx.S
>> -ACFLAGS.ghashv8-armx.S=3D        -march=3Darmv8-a+crypto
>> +ACFLAGS.ghashv8-armx.S+=3D       -march=3Darmv8-a+crypto


I'll probably not do any more exploring of kernel
vs. world cortex-x1c/cortex-a78c feature use vs.
not combinations.

My  "-mcpu=3Dcortex-x1c+flagm context" based from scratch
build of my ports took  somewhat over 15 hrs on the WDK23:

[main-CA78C-default] [2023-05-10_01h26m04s] [committing:] Queued: 480 =
Built: 480 Failed: 0   Skipped: 0   Ignored: 0   Fetched: 0   Tobuild: 0 =
   Time: 15:08:47

Beyond using a -mcpu=3Dcortex-x1c+flagm+nolse+rcpc based
context now, I've also recently changed the build sequence
to use 2 stages to help avoid a long-tail-of-build being
largely one process at a time (single thread) time:

poudriere bulk -jmain-CA78C -w -f ~/origins/build-first.txt
poudriere bulk -jmain-CA78C -w -f ~/origins/CA78C-origins.txt

# more ~/origins/build-first.txt=20
devel/binutils
devel/boost-jam
devel/llvm16
devel/llvm15
lang/rust

(Actually my test was without boost-jam being listed.
I added that after the test.  I also later added
PRIORITY_BOOST=3D"boost-libs" to etc/poudriere.conf .
CA78C-origins.txt also lists those port origins, along
with the rest of the things I explicitly want built.)

The above, in my context, happens to lead to devel/boost-libs
building in parallel with other activity.

I use a high-load-average-allowed style of building ports
into packages: ALLOW_MAKE_JOBS=3Dyes and the default number
of builders, so up to 8 on the WDK23. Also: USE_TMPFS=3Dall
(based on about 118 GiBytes of swap, so RAM+SWAP approx=3D 150
GiBytes. Observed swap use got up to a little under 13
GiBytes but was not thrashing.)

(This style would not scale well at some point but works
for what I have access to, even the ThreadRipper 1950X
with its 128 GiBytes of RAM and 32 FreeBSD "cpus". It
has more swap configured.)

Those, combined with the -mcpu=3Dcortex-x1c+flagm+nolse+rcpc
use, has from-scratch port builds down to a slightly over
10 hours on the WDK23:

[main-CA78C-default] [2023-05-13_01h31m02s] [committing:] Queued: 99 =
Built: 99 Failed: 0  Skipped: 0  Ignored: 0  Fetched: 0  Tobuild: 0   =
Time: 05:53:58
[main-CA78C-default] [2023-05-13_07h25m03s] [committing:] Queued: 381 =
Built: 381 Failed: 0   Skipped: 0   Ignored: 0   Fetched: 0   Tobuild: 0 =
   Time: 04:07:07

This context was ZFS.  I've not done a UFS-context test yet.


=3D=3D=3D
Mark Millard
marklmi at yahoo.com




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?049ED1F8-CA62-4564-8635-4EFCF008ED9D>