Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 12 Aug 2023 10:05:35 -0700
From:      Kevin Oberman <rkoberman@gmail.com>
To:        Michael Butler <imb@protected-networks.net>
Cc:        freebsd-current@freebsd.org
Subject:   Re: [Intel AlderLake] Read&Write files to FAT32 or UFS partition cause data corrupt due to P-Core&E-Core
Message-ID:  <CAN6yY1voyPqTNWWG6qECx3FZOK3X1W2tN8teCXvSSE90V28O%2BA@mail.gmail.com>
In-Reply-To: <4f0fbb44-eebe-aa8f-f958-dcd678936fe1@protected-networks.net>
References:  <YhE1rWoA%2BhMfebq/@kib.kiev.ua> <59cbcfe2-cd53-69d8-65d6-7a79e656f494@FreeBSD.org> <YhVnsB5ZwLYmpAFP@kib.kiev.ua> <1f968af1-1c57-9a09-7e01-145a5262e27f@FreeBSD.org> <YhVyFIFA5XnbGHej@kib.kiev.ua> <20230806181238.858f58e25dfd0f99269cfe53@dec.sakura.ne.jp> <ZM9t--jEqyc4_Z4t@kib.kiev.ua> <20230808063735.e8e1d3ede370a18f200a6f48@dec.sakura.ne.jp> <ZNI3VoFklDaSED59@kib.kiev.ua> <20230808224612.c3889d6e20b6fc980f5278cc@dec.sakura.ne.jp> <ZNJK-PPUhm00ndXs@kib.kiev.ua> <20230808235635.744e0e1c6a72face7fdf6a9b@dec.sakura.ne.jp> <4f0fbb44-eebe-aa8f-f958-dcd678936fe1@protected-networks.net>

index | next in thread | previous in thread | raw e-mail

[-- Attachment #1 --]
On Tue, Aug 8, 2023 at 10:50 AM Michael Butler <imb@protected-networks.net>
wrote:

> On 8/8/23 10:56, Tomoaki AOKI wrote:
> > On Tue, 8 Aug 2023 17:02:32 +0300
> > Konstantin Belousov <kostikbel@gmail.com> wrote:
>
>   [ .. snip .. ]
>
> >> The workaround is switched on automatically, when kernel detects 'small
> cores'
> >> reported by CPUID.
> >
> > If I read the code correctly, vm.pmap.pcid_invlpg_workaround
> > (precicely, the corresponding variable) is set to non-zero when the
> > workaround is enabled. Not sure it was detected correctly at the
> > original reporter's environment, but forcibly setting the tunable to 1
> > didn't reported to help sufficiently.
> > Currently, only setting tunable vm.pmap.pcid_enabled to 0 could help.
>
> I'm seeing similar stability problems on an N95-based device. This too
> is an Alderlake-N device with only E-cores although I'm running it with
> a compilation with CPUTYPE=tremont .. from an older, verbose start-up ..
>
> PPIM 0: PA=0x4000000000, VA=0xffffffff82710000, size=0x1d5000, mode=0x1
> pmap: large map 8 PML4 slots (4096 GB)
> VT(efifb): resolution 800x600
> Preloaded elf kernel "/boot/kernel.new/kernel" at 0xffffffff8234e000.
> Preloaded boot_entropy_cache "/boot/entropy" at 0xffffffff82357d08.
> Preloaded cpu_microcode "/boot/firmware/intel-ucode.bin" at
> 0xffffffff82357d60.
> Preloaded hostuuid "/etc/hostid" at 0xffffffff82357dc0.
> Preloaded TSLOG data "TSLOG" at 0xffffffff82357e10.
> CPU: Intel(R) N95 (1689.60-MHz K8-class CPU)
>    Origin="GenuineIntel"  Id=0xb06e0  Family=0x6  Model=0xbe  Stepping=0
>
>
> Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
>
>
> Features2=0x7ffafbbf<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,SDBG,FMA,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
>    AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
>    AMD Features2=0x121<LAHF,ABM,Prefetch>
>    Structured Extended
>
> Features=0x239ca7eb<FSGSBASE,TSCADJ,BMI1,AVX2,FDPEXC,SMEP,BMI2,ERMS,INVPCID,NFPUSG,PQE,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,PROCTRACE,SHA>
>    Structured Extended
>
> Features2=0x98c007bc<UMIP,PKU,OSPKE,WAITPKG,GFNI,VAES,VPCLMULQDQ,RDPID,MOVDIRI,MOVDIR64B>
>    Structured Extended
>
> Features3=0xfc184410<FSRM,MD_CLEAR,IBT,IBPB,STIBP,L1DFL,ARCH_CAP,CORE_CAP,SSBD>
>    XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
>    IA32_ARCH_CAPS=0x180fd6b<RDCL_NO,IBRS_ALL,SKIP_L1DFL_VME,MDS_NO,TAA_NO>
>    VT-x: Basic Features=0x3da0500<SMM,INS/OUTS,TRUE>
>          Pin-Based Controls=0xff<ExtINT,NMI,VNMI,PreTmr,PostIntr>
>          Primary Processor
>
> Controls=0xfffbfffe<INTWIN,TSCOff,HLT,INVLPG,MWAIT,RDPMC,RDTSC,CR3-LD,CR3-ST,CR8-LD,CR8-ST,TPR,NMIWIN,MOV-DR,IO,IOmap,MTF,MSRmap,MONITOR,PAUSE>
>          Secondary Processor
>
> Controls=0x75d7fff<APIC,EPT,DT,RDTSCP,x2APIC,VPID,WBINVD,UG,APIC-reg,VID,PAUSE-loop,RDRAND,INVPCID,VMFUNC,VMCS,XSAVES>
>          Exit Controls=0x3da0500<PAT-LD,EFER-SV,PTMR-SV>
>          Entry Controls=0x3da0500
>          EPT Features=0x6f34141<XO,PW4,UC,WB,2M,1G,INVEPT,AD,single,all>
>          VPID Features=0xf01<INVVPID,individual,single,all,single-globals>
>    TSC: P-state invariant, performance statistics
> 64-Byte prefetching
> L2 cache: 2048 kbytes, 16-way associative, 64 bytes/line
> real memory  = 17179869184 (16384 MB)
> Physical memory chunk(s):
> 0x0000000000010000 - 0x000000000009dfff, 581632 bytes (142 pages)
> 0x000000000009f000 - 0x000000000009ffff, 4096 bytes (1 pages)
> 0x0000000000100000 - 0x000000005fffffff, 1609564160 bytes (392960 pages)
> 0x0000000062401000 - 0x000000007264dfff, 270848000 bytes (66125 pages)
> 0x0000000075fff000 - 0x0000000075ffffff, 4096 bytes (1 pages)
> 0x0000000100001000 - 0x0000000462497fff, 14533881856 bytes (3548311 pages)
> 0x000000047fa00000 - 0x000000047fb68fff, 1478656 bytes (361 pages)
> avail memory = 16363008000 (15604 MB)
> CPU microcode: updated from 0xc to 0x10
> MADT: Found CPU APIC ID 0 ACPI ID 0: enabled
> SMP: Added CPU 0 (AP)
> MADT: Found CPU APIC ID 2 ACPI ID 1: enabled
> SMP: Added CPU 2 (AP)
> MADT: Found CPU APIC ID 4 ACPI ID 2: enabled
> SMP: Added CPU 4 (AP)
> MADT: Found CPU APIC ID 6 ACPI ID 3: enabled
> SMP: Added CPU 6 (AP)
>
> On start-up, vm.pmap.pcid_invlpg_workaround=1 but seemingly random
> faults still occurred under load, for example, 'make buildworld'.
> Apparent misreads of source-files resulting in syntax errors were the
> most common symptom. Compilation reattempts (mostly) succeed.
>
> Initially, I put this down to an inadequate power-supply but setting
> vm.pmap.pcid_enabled=0 seems to have stabilised it.
>
> I guess there's another dragon in there .. :-(
>
>         Michae
>

Just to add another report (in the wrong mail list as it is also on a
system running 13.2), I have a very similar system from a different
manufacturer with the same Alder Lake processor. I will note that the SSD
interface is SATA, not nvme. I was getting crashes and corrupt file
systems, especially when installing large ports and using rsync to backup
the system. I see many, almost identical systems on Amazon that use the
same form factor CPU, SSD, RAM, etc, probably all using the same
motherboard from a single manufacturer. There are going to be more issues
as these boxes are generally <$225 US. (Mine was a bit more expensive to
get a VGA connector for my ancient monitor.

I had not tried the tuneable, but largely resolved the issue by installing
a 250 MB hard drive and putting the system there. In the couple of months
since I did this I have had two crashes, both when doing a full backup with
rsync. This leads me to think that there is some sort of race triggering
this that is minimized by the slow disc speed of spinning rust.

I am considering moving the system back to the SSD with
vm.pmap.pcid_enabled=0. If so, the failure should be very quick as I never
could keep the system up long enough to get the system into production.
-- 
Kevin Oberman, Part time kid herder and retired Network Engineer
E-mail: rkoberman@gmail.com
PGP Fingerprint: D03FB98AFA78E3B78C1694B318AB39EF1B055683

[-- Attachment #2 --]
<div dir="ltr"><div dir="ltr"><div class="gmail_default" style="font-family:tahoma,sans-serif;font-size:small">On Tue, Aug 8, 2023 at 10:50 AM Michael Butler &lt;<a href="mailto:imb@protected-networks.net">imb@protected-networks.net</a>&gt; wrote:</div></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 8/8/23 10:56, Tomoaki AOKI wrote:<br>
&gt; On Tue, 8 Aug 2023 17:02:32 +0300<br>
&gt; Konstantin Belousov &lt;<a href="mailto:kostikbel@gmail.com" target="_blank">kostikbel@gmail.com</a>&gt; wrote:<br>
<br>
  [ .. snip .. ]<br>
<br>
&gt;&gt; The workaround is switched on automatically, when kernel detects &#39;small cores&#39;<br>
&gt;&gt; reported by CPUID.<br>
&gt; <br>
&gt; If I read the code correctly, vm.pmap.pcid_invlpg_workaround<br>
&gt; (precicely, the corresponding variable) is set to non-zero when the<br>
&gt; workaround is enabled. Not sure it was detected correctly at the<br>
&gt; original reporter&#39;s environment, but forcibly setting the tunable to 1<br>
&gt; didn&#39;t reported to help sufficiently.<br>
&gt; Currently, only setting tunable vm.pmap.pcid_enabled to 0 could help.<br>
<br>
I&#39;m seeing similar stability problems on an N95-based device. This too <br>
is an Alderlake-N device with only E-cores although I&#39;m running it with <br>
a compilation with CPUTYPE=tremont .. from an older, verbose start-up ..<br>
<br>
PPIM 0: PA=0x4000000000, VA=0xffffffff82710000, size=0x1d5000, mode=0x1<br>
pmap: large map 8 PML4 slots (4096 GB)<br>
VT(efifb): resolution 800x600<br>
Preloaded elf kernel &quot;/boot/kernel.new/kernel&quot; at 0xffffffff8234e000.<br>
Preloaded boot_entropy_cache &quot;/boot/entropy&quot; at 0xffffffff82357d08.<br>
Preloaded cpu_microcode &quot;/boot/firmware/intel-ucode.bin&quot; at <br>
0xffffffff82357d60.<br>
Preloaded hostuuid &quot;/etc/hostid&quot; at 0xffffffff82357dc0.<br>
Preloaded TSLOG data &quot;TSLOG&quot; at 0xffffffff82357e10.<br>
CPU: Intel(R) N95 (1689.60-MHz K8-class CPU)<br>
   Origin=&quot;GenuineIntel&quot;  Id=0xb06e0  Family=0x6  Model=0xbe  Stepping=0<br>
<br>
Features=0xbfebfbff&lt;FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE&gt;<br>
<br>
Features2=0x7ffafbbf&lt;SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,SDBG,FMA,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND&gt;<br>
   AMD Features=0x2c100800&lt;SYSCALL,NX,Page1GB,RDTSCP,LM&gt;<br>
   AMD Features2=0x121&lt;LAHF,ABM,Prefetch&gt;<br>
   Structured Extended <br>
Features=0x239ca7eb&lt;FSGSBASE,TSCADJ,BMI1,AVX2,FDPEXC,SMEP,BMI2,ERMS,INVPCID,NFPUSG,PQE,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,PROCTRACE,SHA&gt;<br>
   Structured Extended <br>
Features2=0x98c007bc&lt;UMIP,PKU,OSPKE,WAITPKG,GFNI,VAES,VPCLMULQDQ,RDPID,MOVDIRI,MOVDIR64B&gt;<br>
   Structured Extended <br>
Features3=0xfc184410&lt;FSRM,MD_CLEAR,IBT,IBPB,STIBP,L1DFL,ARCH_CAP,CORE_CAP,SSBD&gt;<br>
   XSAVE Features=0xf&lt;XSAVEOPT,XSAVEC,XINUSE,XSAVES&gt;<br>
   IA32_ARCH_CAPS=0x180fd6b&lt;RDCL_NO,IBRS_ALL,SKIP_L1DFL_VME,MDS_NO,TAA_NO&gt;<br>
   VT-x: Basic Features=0x3da0500&lt;SMM,INS/OUTS,TRUE&gt;<br>
         Pin-Based Controls=0xff&lt;ExtINT,NMI,VNMI,PreTmr,PostIntr&gt;<br>
         Primary Processor <br>
Controls=0xfffbfffe&lt;INTWIN,TSCOff,HLT,INVLPG,MWAIT,RDPMC,RDTSC,CR3-LD,CR3-ST,CR8-LD,CR8-ST,TPR,NMIWIN,MOV-DR,IO,IOmap,MTF,MSRmap,MONITOR,PAUSE&gt;<br>
         Secondary Processor <br>
Controls=0x75d7fff&lt;APIC,EPT,DT,RDTSCP,x2APIC,VPID,WBINVD,UG,APIC-reg,VID,PAUSE-loop,RDRAND,INVPCID,VMFUNC,VMCS,XSAVES&gt;<br>
         Exit Controls=0x3da0500&lt;PAT-LD,EFER-SV,PTMR-SV&gt;<br>
         Entry Controls=0x3da0500<br>
         EPT Features=0x6f34141&lt;XO,PW4,UC,WB,2M,1G,INVEPT,AD,single,all&gt;<br>
         VPID Features=0xf01&lt;INVVPID,individual,single,all,single-globals&gt;<br>
   TSC: P-state invariant, performance statistics<br>
64-Byte prefetching<br>
L2 cache: 2048 kbytes, 16-way associative, 64 bytes/line<br>
real memory  = 17179869184 (16384 MB)<br>
Physical memory chunk(s):<br>
0x0000000000010000 - 0x000000000009dfff, 581632 bytes (142 pages)<br>
0x000000000009f000 - 0x000000000009ffff, 4096 bytes (1 pages)<br>
0x0000000000100000 - 0x000000005fffffff, 1609564160 bytes (392960 pages)<br>
0x0000000062401000 - 0x000000007264dfff, 270848000 bytes (66125 pages)<br>
0x0000000075fff000 - 0x0000000075ffffff, 4096 bytes (1 pages)<br>
0x0000000100001000 - 0x0000000462497fff, 14533881856 bytes (3548311 pages)<br>
0x000000047fa00000 - 0x000000047fb68fff, 1478656 bytes (361 pages)<br>
avail memory = 16363008000 (15604 MB)<br>
CPU microcode: updated from 0xc to 0x10<br>
MADT: Found CPU APIC ID 0 ACPI ID 0: enabled<br>
SMP: Added CPU 0 (AP)<br>
MADT: Found CPU APIC ID 2 ACPI ID 1: enabled<br>
SMP: Added CPU 2 (AP)<br>
MADT: Found CPU APIC ID 4 ACPI ID 2: enabled<br>
SMP: Added CPU 4 (AP)<br>
MADT: Found CPU APIC ID 6 ACPI ID 3: enabled<br>
SMP: Added CPU 6 (AP)<br>
<br>
On start-up, vm.pmap.pcid_invlpg_workaround=1 but seemingly random <br>
faults still occurred under load, for example, &#39;make buildworld&#39;. <br>
Apparent misreads of source-files resulting in syntax errors were the <br>
most common symptom. Compilation reattempts (mostly) succeed.<br>
<br>
Initially, I put this down to an inadequate power-supply but setting <br>
vm.pmap.pcid_enabled=0 seems to have stabilised it.<br>
<br>
I guess there&#39;s another dragon in there .. :-(<br>
<br>
        Michae<br>
</blockquote></div><br clear="all"><div style="font-family:tahoma,sans-serif;font-size:small" class="gmail_default">Just to add another report (in the wrong mail list as it is also on a system running 13.2), I have a very similar system from a different manufacturer with the same Alder Lake processor. I will note that the SSD interface is SATA, not nvme. I was getting crashes and corrupt file systems, especially when installing large ports and using rsync to backup the system. I see many, almost identical systems on Amazon that use the same form factor CPU, SSD, RAM, etc, probably all using the same motherboard from a single manufacturer. There are going to be more issues as these boxes are generally &lt;$225 US. (Mine was a bit more expensive to get a VGA connector for my ancient monitor. <br></div><div style="font-family:tahoma,sans-serif;font-size:small" class="gmail_default"><br></div><div style="font-family:tahoma,sans-serif;font-size:small" class="gmail_default">I had not tried the tuneable, but largely resolved the issue by installing a 250 MB hard drive and putting the system there. In the couple of months since I did this I have had two crashes, both when doing a full backup with rsync. This leads me to think that there is some sort of race triggering this that is minimized by the slow disc speed of spinning rust.<br></div><div><br></div><div><div style="font-family:tahoma,sans-serif;font-size:small" class="gmail_default">I am considering moving the system back to the SSD with vm.pmap.pcid_enabled=0. If so, the failure should be very quick as I never could keep the system up long enough to get the system into production.<br></div></div><span class="gmail_signature_prefix">-- </span><br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr">Kevin Oberman, Part time kid herder and retired Network Engineer<br>E-mail: <a href="mailto:rkoberman@gmail.com" target="_blank">rkoberman@gmail.com</a><br></div><div>PGP Fingerprint: D03FB98AFA78E3B78C1694B318AB39EF1B055683</div></div></div></div></div></div></div></div></div>
help

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAN6yY1voyPqTNWWG6qECx3FZOK3X1W2tN8teCXvSSE90V28O%2BA>