Date: Wed, 24 Nov 2021 13:23:13 -0800 From: Mark Millard via arm <arm@freebsd.org> To: allanjude@freebsd.org, "freebsd-arm@freebsd.org" <arm@freebsd.org> Subject: Re: git: 32a2fed6e71f - stable/13 - openssl: Fix detection of ARMv7 and ARM64 CPU features Message-ID: <AF9491B0-2F97-459E-9BD9-32354DAB86C9@yahoo.com> In-Reply-To: <F68146E3-1FE6-4476-B72F-ACF3F317A038@yahoo.com> References: <0CEA37B8-CE7F-4BAE-92B7-E71C5FD1BC22@yahoo.com> <F68146E3-1FE6-4476-B72F-ACF3F317A038@yahoo.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 2021-Nov-24, at 13:19, Mark Millard <marklmi@yahoo.com> wrote: > On 2021-Nov-24, at 01:51, Mark Millard <marklmi@yahoo.com> wrote: >=20 >> [Actually, the main [so: 14] equivalent.] >>=20 >> All Cortex-A72 based . . . >>=20 >> First, older system versions (before that update) >> then after the update: >>=20 >>=20 >> RPi4B 8 GiByte (older FreeBSD first, otherwise new), >> Cortex-A72's: >>=20 >> # openssl speed -evp aes-256-gcm >> . . . >> type 16 bytes 64 bytes 256 bytes 1024 bytes = 8192 bytes 16384 bytes >> aes-256-gcm 51925.92k 58449.46k 60430.32k 61050.13k = 61180.98k 61482.75k >>=20 >> type 16 bytes 64 bytes 256 bytes 1024 bytes = 8192 bytes 16384 bytes >> aes-256-gcm 28880.07k 30837.33k 31630.29k 31855.62k = 31921.54k 32034.53k >>=20 >> So: slowed down, unlike the other examples below. >>=20 >> # env OPENSSL_armcap=3D0 openssl speed -evp aes-256-gcm >> . . . >> type 16 bytes 64 bytes 256 bytes 1024 bytes = 8192 bytes 16384 bytes >> aes-256-gcm 51894.33k 58540.45k 60815.22k 61534.47k = 61906.84k 62042.10k >>=20 >> So: back to the prior speed. >>=20 >> But all these are based on config.txt containing: >>=20 >> over_voltage=3D6=20 >> arm_freq=3D2000=20 >> sdram_freq_min=3D3200=20 >> force_turbo=3D1 >>=20 >> (The RPi4B has a heat-sink and a fan.) >>=20 >> Note: See later about the RPi4B CPU features. >>=20 >>=20 >> MACCHIATObin Double Shot (older first), Cortex-A72's: >>=20 >> # openssl speed -evp aes-256-gcm >> . . . >> type 16 bytes 64 bytes 256 bytes 1024 bytes = 8192 bytes 16384 bytes >> aes-256-gcm 50808.49k 58466.08k 60769.11k 61444.92k = 61767.94k 61707.61k >>=20 >> type 16 bytes 64 bytes 256 bytes 1024 bytes = 8192 bytes 16384 bytes >> aes-256-gcm 163579.14k 456319.27k 786544.01k 940234.41k = 1003230.55k 1005671.31k >>=20 >>=20 >> HoneyComb (older first), Cortex-A782's: >>=20 >> # openssl speed -evp aes-256-gcm >> . . . >> type 16 bytes 64 bytes 256 bytes 1024 bytes = 8192 bytes 16384 bytes >> aes-256-gcm 57659.60k 64599.05k 67719.81k 68373.74k = 68724.24k 68793.80k >>=20 >> type 16 bytes 64 bytes 256 bytes 1024 bytes = 8192 bytes 16384 bytes >> aes-256-gcm 177925.57k 502311.65k 866287.95k 1036500.35k = 1106598.06k 1106721.91k >>=20 >> Rock64 (older first), Cortex-A53's: >>=20 >> # openssl speed -evp aes-256-gcm >> . . . >> type 16 bytes 64 bytes 256 bytes 1024 bytes = 8192 bytes 16384 bytes >> aes-256-gcm 18378.23k 23401.45k 24834.99k 25206.10k = 25337.86k 25258.19k >>=20 >> type 16 bytes 64 bytes 256 bytes 1024 bytes = 8192 bytes 16384 bytes >> aes-256-gcm 52711.29k 163586.49k 318738.69k 420277.93k = 461373.44k 463192.06k >>=20 >>=20 >> OPi+2E (older first), Cortex-A7's (so armv7): >>=20 >> # openssl speed -evp aes-256-gcm >> . . . >> type 16 bytes 64 bytes 256 bytes 1024 bytes = 8192 bytes 16384 bytes >> aes-256-gcm 9343.10k 11156.39k 11827.64k 11995.30k = 12025.86k 12031.32k >>=20 >> type 16 bytes 64 bytes 256 bytes 1024 bytes = 8192 bytes 16384 bytes >> aes-256-gcm 11013.41k 13598.44k 14034.26k 15045.97k = 15262.90k 15302.66k >>=20 >>=20 >>=20 >> For reference: >>=20 >> For the RPi4B examples (2 notes added): >>=20 >> CPU 0: ARM Cortex-A72 r0p3 affinity: 0 >> Cache Type =3D <64 byte D-cacheline,64 byte = I-cacheline,PIPT ICache,64 byte ERG,64 byte CWG> >> Instruction Set Attributes 0 =3D <CRC32> >> *** NOTE the lack of ",SHA2,SHA1,AES+PMULL" above *** >> Instruction Set Attributes 1 =3D <> >> Processor Features 0 =3D <AdvSIMD,FP,EL3 32,EL2 32,EL1 32,EL0 = 32> >> Processor Features 1 =3D <> >> Memory Model Features 0 =3D <TGran4,TGran64,SNSMem,BigEnd,16bit = ASID,16TB PA> >> Memory Model Features 1 =3D <8bit VMID> >> Memory Model Features 2 =3D <32bit CCIDX,48bit VA> >> Debug Features 0 =3D <DoubleLock,2 CTX BKPTs,4 = Watchpoints,6 Breakpoints,PMUv3,Debugv8> >> Debug Features 1 =3D <> >> Auxiliary Features 0 =3D <> >> Auxiliary Features 1 =3D <> >> AArch32 Instruction Set Attributes 5 =3D <CRC32,SEVL> >> *** NOTE the lack of ",SHA2,SHA1,AES+VMULL" above *** >> AArch32 Media and VFP Features 0 =3D <FPRound,FPSqrt,FPDivide,DP = VFPv3+v4,SP VFPv3+v4,AdvSIMD> >> AArch32 Media and VFP Features 1 =3D <SIMDFMAC,FPHP DP Conv,SIMDHP SP = Conv,SIMDSP,SIMDInt,SIMDLS,FPDNaN,FPFtZ> >>=20 >> For the MACCHIATObin Double Shot examples: >>=20 >> CPU 0: ARM Cortex-A72 r0p1 affinity: 0 0 >> Cache Type =3D <64 byte D-cacheline,64 byte = I-cacheline,PIPT ICache,64 byte ERG,64 byte CWG> >> Instruction Set Attributes 0 =3D <CRC32,SHA2,SHA1,AES+PMULL> >> Instruction Set Attributes 1 =3D <> >> Processor Features 0 =3D <AdvSIMD,FP,EL3 32,EL2 32,EL1 32,EL0 = 32> >> Processor Features 1 =3D <> >> Memory Model Features 0 =3D <TGran4,TGran64,SNSMem,BigEnd,16bit = ASID,16TB PA> >> Memory Model Features 1 =3D <8bit VMID> >> Memory Model Features 2 =3D <32bit CCIDX,48bit VA> >> Debug Features 0 =3D <DoubleLock,2 CTX BKPTs,4 = Watchpoints,6 Breakpoints,PMUv3,Debugv8> >> Debug Features 1 =3D <> >> Auxiliary Features 0 =3D <> >> Auxiliary Features 1 =3D <> >> AArch32 Instruction Set Attributes 5 =3D = <CRC32,SHA2,SHA1,AES+VMULL,SEVL> >> AArch32 Media and VFP Features 0 =3D <FPRound,FPSqrt,FPDivide,DP = VFPv3+v4,SP VFPv3+v4,AdvSIMD> >> AArch32 Media and VFP Features 1 =3D <SIMDFMAC,FPHP DP Conv,SIMDHP SP = Conv,SIMDSP,SIMDInt,SIMDLS,FPDNaN,FPFtZ> >>=20 >>=20 >> For the HoneyComb examples: >>=20 >> CPU 0: ARM Cortex-A72 r0p3 affinity: 0 0 >> Cache Type =3D <64 byte D-cacheline,64 byte = I-cacheline,PIPT ICache,64 byte ERG,64 byte CWG> >> Instruction Set Attributes 0 =3D <CRC32,SHA2,SHA1,AES+PMULL> >> Instruction Set Attributes 1 =3D <> >> Processor Features 0 =3D <GIC,AdvSIMD,FP,EL3 32,EL2 32,EL1 = 32,EL0 32> >> Processor Features 1 =3D <> >> Memory Model Features 0 =3D <TGran4,TGran64,SNSMem,BigEnd,16bit = ASID,16TB PA> >> Memory Model Features 1 =3D <8bit VMID> >> Memory Model Features 2 =3D <32bit CCIDX,48bit VA> >> Debug Features 0 =3D <DoubleLock,2 CTX BKPTs,4 = Watchpoints,6 Breakpoints,PMUv3,Debugv8> >> Debug Features 1 =3D <> >> Auxiliary Features 0 =3D <> >> Auxiliary Features 1 =3D <> >> AArch32 Instruction Set Attributes 5 =3D = <CRC32,SHA2,SHA1,AES+VMULL,SEVL> >> AArch32 Media and VFP Features 0 =3D <FPRound,FPSqrt,FPDivide,DP = VFPv3+v4,SP VFPv3+v4,AdvSIMD> >> AArch32 Media and VFP Features 1 =3D <SIMDFMAC,FPHP DP Conv,SIMDHP SP = Conv,SIMDSP,SIMDInt,SIMDLS,FPDNaN,FPFtZ> >>=20 >>=20 >>=20 >>=20 >> For the Rock64 examples: >>=20 >> CPU 0: ARM Cortex-A53 r0p4 affinity: 0 >> Cache Type =3D <64 byte D-cacheline,64 byte = I-cacheline,VIPT ICache,64 byte ERG,64 byte CWG> >> Instruction Set Attributes 0 =3D <CRC32,SHA2,SHA1,AES+PMULL> >> Instruction Set Attributes 1 =3D <> >> Processor Features 0 =3D <AdvSIMD,FP,EL3 32,EL2 32,EL1 32,EL0 = 32> >> Processor Features 1 =3D <> >> Memory Model Features 0 =3D <TGran4,TGran64,SNSMem,BigEnd,16bit = ASID,1TB PA> >> Memory Model Features 1 =3D <8bit VMID> >> Memory Model Features 2 =3D <32bit CCIDX,48bit VA> >> Debug Features 0 =3D <DoubleLock,2 CTX BKPTs,4 = Watchpoints,6 Breakpoints,PMUv3,Debugv8> >> Debug Features 1 =3D <> >> Auxiliary Features 0 =3D <> >> Auxiliary Features 1 =3D <> >> AArch32 Instruction Set Attributes 5 =3D = <CRC32,SHA2,SHA1,AES+VMULL,SEVL> >> AArch32 Media and VFP Features 0 =3D <FPRound,FPSqrt,FPDivide,DP = VFPv3+v4,SP VFPv3+v4,AdvSIMD> >> AArch32 Media and VFP Features 1 =3D <SIMDFMAC,FPHP DP Conv,SIMDHP SP = Conv,SIMDSP,SIMDInt,SIMDLS,FPDNaN,FPFtZ> >> C >>=20 >>=20 >> For the OPi+2E examples: >>=20 >> CPU: ARM Cortex-A7 r0p5 (ECO: 0x00000000) >> CPU Features:=20 >> Multiprocessing, Thumb2, Security, Virtualization, Generic Timer, = VMSAv7, >> PXN, LPAE, Coherent Walk >> Optional instructions:=20 >> SDIV/UDIV, UMULL, SMULL, SIMD(ext) >> LoUU:2 LoC:3 LoUIS:2=20 >> Cache level 1: >> 32KB/64B 4-way data cache WB Read-Alloc Write-Alloc >> 32KB/32B 2-way instruction cache Read-Alloc >> Cache level 2: >> 512KB/64B 8-way unified cache WB Read-Alloc Write-Alloc >=20 > Note: as the issue applies to stable/13 and main [so: 14] > (for example), I continue to use the freebsd-arm list > instead of a list that reports commits to stable/* but > not to main. >=20 > Relative to: >=20 > #define HWCAP_FP 0x00000001 > #define HWCAP_ASIMD 0x00000002 > #define HWCAP_EVTSTRM 0x00000004 > #define HWCAP_AES 0x00000008 > #define HWCAP_PMULL 0x00000010 > #define HWCAP_SHA1 0x00000020 > #define HWCAP_SHA2 0x00000040 > #define HWCAP_CRC32 0x00000080 >=20 > The single-bit enabled OPENSSL_armcap that gets the slow > result is: >=20 > # env OPENSSL_armcap=3D1 openssl speed -evp aes-256-gcm > . . . > type 16 bytes 64 bytes 256 bytes 1024 bytes = 8192 bytes 16384 bytes > aes-256-gcm 28427.04k 30712.32k 31446.00k 31683.40k = 31829.10k 31839.55k >=20 > The illegal instruction ones for aes-256-gcm were: >=20 > # env OPENSSL_armcap=3D4 openssl speed -evp aes-256-gcm > Doing aes-256-gcm for 3s on 16 size blocks: Illegal instruction (core = dumped) >=20 > env OPENSSL_armcap=3D32 openssl speed -evp aes-256-gcm > Doing aes-256-gcm for 3s on 16 size blocks: Illegal instruction (core = dumped) >=20 > (sha256 does not match for what is illegal.) >=20 > Ignoring the illegal-instruction producing bits, HWCAP_FP mixed > with any one of the other bits was also similarly slow. >=20 > As for all the non-illegal-instruction producing bits: also similarly > slow: >=20 > # env OPENSSL_armcap=3D219 openssl speed -evp aes-256-gcm > . . . > type 16 bytes 64 bytes 256 bytes 1024 bytes = 8192 bytes 16384 bytes > aes-256-gcm 28922.63k 30711.51k 31522.15k 31722.15k = 31788.97k 31845.03k >=20 > Disabling just HWCAP_FP from that got the fast category of > result: >=20 > # env OPENSSL_armcap=3D218 openssl speed -evp aes-256-gcm > . . . > type 16 bytes 64 bytes 256 bytes 1024 bytes = 8192 bytes 16384 bytes > aes-256-gcm 49543.14k 58068.22k 60236.56k 60724.37k = 61216.09k 61212.99k >=20 >=20 > As for sha256 . . . >=20 > # env OPENSSL_armcap=3D0 openssl speed -evp sha256 > . . . > type 16 bytes 64 bytes 256 bytes 1024 bytes = 8192 bytes 16384 bytes > sha256 22434.19k 59895.91k 117258.16k 156264.31k = 172624.81k 173848.52k >=20 > (I'll not list all the similar performing ones but > will list all illegal-instruction producing ones.) >=20 > # env OPENSSL_armcap=3D4 openssl speed -evp sha256 > Doing sha256 for 3s on 16 size blocks: 4082055 sha256's in 2.99s > Doing sha256 for 3s on 64 size blocks: 2752520 sha256's in 3.02s > Doing sha256 for 3s on 256 size blocks: 1372584 sha256's in 3.03s > Doing sha256 for 3s on 1024 size blocks: 470215 sha256's in 3.11s > Doing sha256 for 3s on 8192 size blocks: 64700 sha256's in 3.07s > Doing sha256 for 3s on 16384 size blocks: 31847 sha256's in 3.00s > Illegal instruction (core dumped) >=20 > # env OPENSSL_armcap=3D16 openssl speed -evp sha256 > Doing sha256 for 3s on 16 size blocks: Illegal instruction (core = dumped) >=20 > (16 worked for aes-256-gcm but 32 did not.) >=20 > So: no significantly slower examples of single enabled > bit cases. >=20 > No (non-illegal-instruction) 2-enabled-bits examples were > dissimilar for the speed. Incorrect description of what I tested: I testd only 2-bit combinations involving HWCAP_FP being enabled. (Same as for aes-256-gcm .) > For reference (avoiding illegal-instructions): >=20 > # env OPENSSL_armcap=3D235 openssl speed -evp sha256 > . . . > type 16 bytes 64 bytes 256 bytes 1024 bytes = 8192 bytes 16384 bytes > sha256 23185.66k 62689.73k 125814.72k 167981.88k = 187833.65k 188968.95k >=20 > So: also similar speed. >=20 > Need any other specific bit combinations? =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?AF9491B0-2F97-459E-9BD9-32354DAB86C9>