Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 24 Nov 2021 13:19:16 -0800
From:      Mark Millard via arm <arm@freebsd.org>
To:        allanjude@freebsd.org, "freebsd-arm@freebsd.org" <arm@freebsd.org>
Subject:   Re: git: 32a2fed6e71f - stable/13 - openssl: Fix detection of ARMv7 and ARM64 CPU features
Message-ID:  <F68146E3-1FE6-4476-B72F-ACF3F317A038@yahoo.com>
In-Reply-To: <0CEA37B8-CE7F-4BAE-92B7-E71C5FD1BC22@yahoo.com>
References:  <0CEA37B8-CE7F-4BAE-92B7-E71C5FD1BC22@yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help


On 2021-Nov-24, at 01:51, Mark Millard <marklmi@yahoo.com> wrote:

> [Actually, the main [so: 14] equivalent.]
>=20
> All Cortex-A72 based . . .
>=20
> First, older system versions (before that update)
> then after the update:
>=20
>=20
> RPi4B 8 GiByte (older FreeBSD first, otherwise new),
> Cortex-A72's:
>=20
> # openssl speed -evp aes-256-gcm
> . . .
> type             16 bytes     64 bytes    256 bytes   1024 bytes   =
8192 bytes  16384 bytes
> aes-256-gcm      51925.92k    58449.46k    60430.32k    61050.13k    =
61180.98k    61482.75k
>=20
> type             16 bytes     64 bytes    256 bytes   1024 bytes   =
8192 bytes  16384 bytes
> aes-256-gcm      28880.07k    30837.33k    31630.29k    31855.62k    =
31921.54k    32034.53k
>=20
> So: slowed down, unlike the other examples below.
>=20
> # env OPENSSL_armcap=3D0 openssl speed -evp aes-256-gcm
> . . .
> type             16 bytes     64 bytes    256 bytes   1024 bytes   =
8192 bytes  16384 bytes
> aes-256-gcm      51894.33k    58540.45k    60815.22k    61534.47k    =
61906.84k    62042.10k
>=20
> So: back to the prior speed.
>=20
> But all these are based on config.txt containing:
>=20
> over_voltage=3D6=20
> arm_freq=3D2000=20
> sdram_freq_min=3D3200=20
> force_turbo=3D1
>=20
> (The RPi4B has a heat-sink and a fan.)
>=20
> Note: See later about the RPi4B CPU features.
>=20
>=20
> MACCHIATObin Double Shot (older first), Cortex-A72's:
>=20
> # openssl speed -evp aes-256-gcm
> . . .
> type             16 bytes     64 bytes    256 bytes   1024 bytes   =
8192 bytes  16384 bytes
> aes-256-gcm      50808.49k    58466.08k    60769.11k    61444.92k    =
61767.94k    61707.61k
>=20
> type             16 bytes     64 bytes    256 bytes   1024 bytes   =
8192 bytes  16384 bytes
> aes-256-gcm     163579.14k   456319.27k   786544.01k   940234.41k  =
1003230.55k  1005671.31k
>=20
>=20
> HoneyComb (older first), Cortex-A782's:
>=20
> # openssl speed -evp aes-256-gcm
> . . .
> type             16 bytes     64 bytes    256 bytes   1024 bytes   =
8192 bytes  16384 bytes
> aes-256-gcm      57659.60k    64599.05k    67719.81k    68373.74k    =
68724.24k    68793.80k
>=20
> type             16 bytes     64 bytes    256 bytes   1024 bytes   =
8192 bytes  16384 bytes
> aes-256-gcm     177925.57k   502311.65k   866287.95k  1036500.35k  =
1106598.06k  1106721.91k
>=20
> Rock64 (older first), Cortex-A53's:
>=20
> # openssl speed -evp aes-256-gcm
> . . .
> type             16 bytes     64 bytes    256 bytes   1024 bytes   =
8192 bytes  16384 bytes
> aes-256-gcm      18378.23k    23401.45k    24834.99k    25206.10k    =
25337.86k    25258.19k
>=20
> type             16 bytes     64 bytes    256 bytes   1024 bytes   =
8192 bytes  16384 bytes
> aes-256-gcm      52711.29k   163586.49k   318738.69k   420277.93k   =
461373.44k   463192.06k
>=20
>=20
> OPi+2E (older first), Cortex-A7's (so armv7):
>=20
> # openssl speed -evp aes-256-gcm
> . . .
> type             16 bytes     64 bytes    256 bytes   1024 bytes   =
8192 bytes  16384 bytes
> aes-256-gcm       9343.10k    11156.39k    11827.64k    11995.30k    =
12025.86k    12031.32k
>=20
> type             16 bytes     64 bytes    256 bytes   1024 bytes   =
8192 bytes  16384 bytes
> aes-256-gcm      11013.41k    13598.44k    14034.26k    15045.97k    =
15262.90k    15302.66k
>=20
>=20
>=20
> For reference:
>=20
> For the RPi4B examples (2 notes added):
>=20
> CPU  0: ARM Cortex-A72 r0p3 affinity:  0
>                   Cache Type =3D <64 byte D-cacheline,64 byte =
I-cacheline,PIPT ICache,64 byte ERG,64 byte CWG>
> Instruction Set Attributes 0 =3D <CRC32>
> *** NOTE the lack of ",SHA2,SHA1,AES+PMULL" above ***
> Instruction Set Attributes 1 =3D <>
>         Processor Features 0 =3D <AdvSIMD,FP,EL3 32,EL2 32,EL1 32,EL0 =
32>
>         Processor Features 1 =3D <>
>      Memory Model Features 0 =3D <TGran4,TGran64,SNSMem,BigEnd,16bit =
ASID,16TB PA>
>      Memory Model Features 1 =3D <8bit VMID>
>      Memory Model Features 2 =3D <32bit CCIDX,48bit VA>
>             Debug Features 0 =3D <DoubleLock,2 CTX BKPTs,4 =
Watchpoints,6 Breakpoints,PMUv3,Debugv8>
>             Debug Features 1 =3D <>
>         Auxiliary Features 0 =3D <>
>         Auxiliary Features 1 =3D <>
> AArch32 Instruction Set Attributes 5 =3D <CRC32,SEVL>
> *** NOTE the lack of ",SHA2,SHA1,AES+VMULL" above ***
> AArch32 Media and VFP Features 0 =3D <FPRound,FPSqrt,FPDivide,DP =
VFPv3+v4,SP VFPv3+v4,AdvSIMD>
> AArch32 Media and VFP Features 1 =3D <SIMDFMAC,FPHP DP Conv,SIMDHP SP =
Conv,SIMDSP,SIMDInt,SIMDLS,FPDNaN,FPFtZ>
>=20
> For the MACCHIATObin Double Shot examples:
>=20
> CPU  0: ARM Cortex-A72 r0p1 affinity:  0  0
>                   Cache Type =3D <64 byte D-cacheline,64 byte =
I-cacheline,PIPT ICache,64 byte ERG,64 byte CWG>
> Instruction Set Attributes 0 =3D <CRC32,SHA2,SHA1,AES+PMULL>
> Instruction Set Attributes 1 =3D <>
>         Processor Features 0 =3D <AdvSIMD,FP,EL3 32,EL2 32,EL1 32,EL0 =
32>
>         Processor Features 1 =3D <>
>      Memory Model Features 0 =3D <TGran4,TGran64,SNSMem,BigEnd,16bit =
ASID,16TB PA>
>      Memory Model Features 1 =3D <8bit VMID>
>      Memory Model Features 2 =3D <32bit CCIDX,48bit VA>
>             Debug Features 0 =3D <DoubleLock,2 CTX BKPTs,4 =
Watchpoints,6 Breakpoints,PMUv3,Debugv8>
>             Debug Features 1 =3D <>
>         Auxiliary Features 0 =3D <>
>         Auxiliary Features 1 =3D <>
> AArch32 Instruction Set Attributes 5 =3D =
<CRC32,SHA2,SHA1,AES+VMULL,SEVL>
> AArch32 Media and VFP Features 0 =3D <FPRound,FPSqrt,FPDivide,DP =
VFPv3+v4,SP VFPv3+v4,AdvSIMD>
> AArch32 Media and VFP Features 1 =3D <SIMDFMAC,FPHP DP Conv,SIMDHP SP =
Conv,SIMDSP,SIMDInt,SIMDLS,FPDNaN,FPFtZ>
>=20
>=20
> For the HoneyComb examples:
>=20
> CPU  0: ARM Cortex-A72 r0p3 affinity:  0  0
>                   Cache Type =3D <64 byte D-cacheline,64 byte =
I-cacheline,PIPT ICache,64 byte ERG,64 byte CWG>
> Instruction Set Attributes 0 =3D <CRC32,SHA2,SHA1,AES+PMULL>
> Instruction Set Attributes 1 =3D <>
>         Processor Features 0 =3D <GIC,AdvSIMD,FP,EL3 32,EL2 32,EL1 =
32,EL0 32>
>         Processor Features 1 =3D <>
>      Memory Model Features 0 =3D <TGran4,TGran64,SNSMem,BigEnd,16bit =
ASID,16TB PA>
>      Memory Model Features 1 =3D <8bit VMID>
>      Memory Model Features 2 =3D <32bit CCIDX,48bit VA>
>             Debug Features 0 =3D <DoubleLock,2 CTX BKPTs,4 =
Watchpoints,6 Breakpoints,PMUv3,Debugv8>
>             Debug Features 1 =3D <>
>         Auxiliary Features 0 =3D <>
>         Auxiliary Features 1 =3D <>
> AArch32 Instruction Set Attributes 5 =3D =
<CRC32,SHA2,SHA1,AES+VMULL,SEVL>
> AArch32 Media and VFP Features 0 =3D <FPRound,FPSqrt,FPDivide,DP =
VFPv3+v4,SP VFPv3+v4,AdvSIMD>
> AArch32 Media and VFP Features 1 =3D <SIMDFMAC,FPHP DP Conv,SIMDHP SP =
Conv,SIMDSP,SIMDInt,SIMDLS,FPDNaN,FPFtZ>
>=20
>=20
>=20
>=20
> For the Rock64 examples:
>=20
> CPU  0: ARM Cortex-A53 r0p4 affinity:  0
>                   Cache Type =3D <64 byte D-cacheline,64 byte =
I-cacheline,VIPT ICache,64 byte ERG,64 byte CWG>
> Instruction Set Attributes 0 =3D <CRC32,SHA2,SHA1,AES+PMULL>
> Instruction Set Attributes 1 =3D <>
>         Processor Features 0 =3D <AdvSIMD,FP,EL3 32,EL2 32,EL1 32,EL0 =
32>
>         Processor Features 1 =3D <>
>      Memory Model Features 0 =3D <TGran4,TGran64,SNSMem,BigEnd,16bit =
ASID,1TB PA>
>      Memory Model Features 1 =3D <8bit VMID>
>      Memory Model Features 2 =3D <32bit CCIDX,48bit VA>
>             Debug Features 0 =3D <DoubleLock,2 CTX BKPTs,4 =
Watchpoints,6 Breakpoints,PMUv3,Debugv8>
>             Debug Features 1 =3D <>
>         Auxiliary Features 0 =3D <>
>         Auxiliary Features 1 =3D <>
> AArch32 Instruction Set Attributes 5 =3D =
<CRC32,SHA2,SHA1,AES+VMULL,SEVL>
> AArch32 Media and VFP Features 0 =3D <FPRound,FPSqrt,FPDivide,DP =
VFPv3+v4,SP VFPv3+v4,AdvSIMD>
> AArch32 Media and VFP Features 1 =3D <SIMDFMAC,FPHP DP Conv,SIMDHP SP =
Conv,SIMDSP,SIMDInt,SIMDLS,FPDNaN,FPFtZ>
> C
>=20
>=20
> For the OPi+2E examples:
>=20
> CPU: ARM Cortex-A7 r0p5 (ECO: 0x00000000)
> CPU Features:=20
>  Multiprocessing, Thumb2, Security, Virtualization, Generic Timer, =
VMSAv7,
>  PXN, LPAE, Coherent Walk
> Optional instructions:=20
>  SDIV/UDIV, UMULL, SMULL, SIMD(ext)
> LoUU:2 LoC:3 LoUIS:2=20
> Cache level 1:
> 32KB/64B 4-way data cache WB Read-Alloc Write-Alloc
> 32KB/32B 2-way instruction cache Read-Alloc
> Cache level 2:
> 512KB/64B 8-way unified cache WB Read-Alloc Write-Alloc

Note: as the issue applies to stable/13 and main [so: 14]
(for example), I continue to use the freebsd-arm list
instead of a list that reports commits to stable/* but
not to main.

Relative to:

#define HWCAP_FP                0x00000001
#define HWCAP_ASIMD             0x00000002
#define HWCAP_EVTSTRM           0x00000004
#define HWCAP_AES               0x00000008
#define HWCAP_PMULL             0x00000010
#define HWCAP_SHA1              0x00000020
#define HWCAP_SHA2              0x00000040
#define HWCAP_CRC32             0x00000080

The single-bit enabled OPENSSL_armcap that gets the slow
result is:

# env OPENSSL_armcap=3D1 openssl speed -evp aes-256-gcm
. . .
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 =
bytes  16384 bytes
aes-256-gcm      28427.04k    30712.32k    31446.00k    31683.40k    =
31829.10k    31839.55k

The illegal instruction ones for aes-256-gcm were:

# env OPENSSL_armcap=3D4 openssl speed -evp aes-256-gcm
Doing aes-256-gcm for 3s on 16 size blocks: Illegal instruction (core =
dumped)

env OPENSSL_armcap=3D32 openssl speed -evp aes-256-gcm
Doing aes-256-gcm for 3s on 16 size blocks: Illegal instruction (core =
dumped)

(sha256 does not match for what is illegal.)

Ignoring the illegal-instruction producing bits, HWCAP_FP mixed
with any one of the other bits was also similarly slow.

As for all the non-illegal-instruction producing bits: also similarly
slow:

# env OPENSSL_armcap=3D219 openssl speed -evp aes-256-gcm
. . .
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 =
bytes  16384 bytes
aes-256-gcm      28922.63k    30711.51k    31522.15k    31722.15k    =
31788.97k    31845.03k

Disabling just HWCAP_FP from that got the fast category of
result:

# env OPENSSL_armcap=3D218 openssl speed -evp aes-256-gcm
. . .
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 =
bytes  16384 bytes
aes-256-gcm      49543.14k    58068.22k    60236.56k    60724.37k    =
61216.09k    61212.99k


As for sha256 . . .

# env OPENSSL_armcap=3D0 openssl speed -evp sha256
. . .
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 =
bytes  16384 bytes
sha256           22434.19k    59895.91k   117258.16k   156264.31k   =
172624.81k   173848.52k

(I'll not list all the similar performing ones but
will list all illegal-instruction producing ones.)

# env OPENSSL_armcap=3D4 openssl speed -evp sha256
Doing sha256 for 3s on 16 size blocks: 4082055 sha256's in 2.99s
Doing sha256 for 3s on 64 size blocks: 2752520 sha256's in 3.02s
Doing sha256 for 3s on 256 size blocks: 1372584 sha256's in 3.03s
Doing sha256 for 3s on 1024 size blocks: 470215 sha256's in 3.11s
Doing sha256 for 3s on 8192 size blocks: 64700 sha256's in 3.07s
Doing sha256 for 3s on 16384 size blocks: 31847 sha256's in 3.00s
Illegal instruction (core dumped)

# env OPENSSL_armcap=3D16 openssl speed -evp sha256
Doing sha256 for 3s on 16 size blocks: Illegal instruction (core dumped)

(16 worked for aes-256-gcm but 32 did not.)

So: no significantly slower examples of single enabled
bit cases.

No (non-illegal-instruction) 2-enabled-bits examples were
dissimilar for the speed.

For reference (avoiding illegal-instructions):

# env OPENSSL_armcap=3D235 openssl speed -evp sha256
. . .
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 =
bytes  16384 bytes
sha256           23185.66k    62689.73k   125814.72k   167981.88k   =
187833.65k   188968.95k

So: also similar speed.

Need any other specific bit combinations?

=3D=3D=3D
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?F68146E3-1FE6-4476-B72F-ACF3F317A038>