Date: Fri, 5 Feb 2021 04:40:03 +0200 From: Konstantin Belousov <kostikbel@gmail.com> To: Alan Somers <asomers@freebsd.org> Cc: Mark Johnston <markj@freebsd.org>, Matthew Macy <mmacy@freebsd.org>, FreeBSD Stable ML <stable@freebsd.org> Subject: Re: Page fault in _mca_init during startup Message-ID: <YBywA/5PHEqDJ4J4@kib.kiev.ua> In-Reply-To: <CAOtMX2hY2WFvtuG2U_4PCqL8fPTqmVPKgHkmh-A88GBz85obNw@mail.gmail.com> References: <CAOtMX2imwP3x-8LBKGFvMJ%2BjuD%2BsH_02yzs9XvMcCHY=jJs86A@mail.gmail.com> <CAPrugNofKuCZmdkb41j%2Bu%2BX0BPV-cK8WjgrBu7akuD=XezseMw@mail.gmail.com> <YBx8GmXvmLnwFYql@kib.kiev.ua> <YByC1ZXP5sNE6aHj@raichu> <CAOtMX2gzaSgL1SosoTYaVqWYVHALpnFSpDQQu1w%2BBEwkO_g=AQ@mail.gmail.com> <YByYZDEbGlSsgcwv@kib.kiev.ua> <CAOtMX2hY2WFvtuG2U_4PCqL8fPTqmVPKgHkmh-A88GBz85obNw@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Feb 04, 2021 at 07:01:30PM -0700, Alan Somers wrote: > On Thu, Feb 4, 2021 at 5:59 PM Konstantin Belousov <kostikbel@gmail.com> > wrote: > > Do you have INVARIANTS enabled? If not, I am curious if enabling them > > would convert that rare page fault into rare "CPU %d has more MC banks" > > assert. > > > > Also might be the output of the > > # for x in $(jot $(sysctl -n hw.ncpu) 0) ; do cpucontrol -m 0x179 > > /dev/cpuctl$x; done > > command will show the issue (0x179 is the MCG_CAP MSR). > > You need to load cpuctl(4) if it is not loaded yet. > > > > I don't have INVARIANTS enabled, and I can't enable it on the production > servers. However, I can turn those three KASSERTs into VERIFYs and see > what happens. Here is what your command shows on the server that panicked: > $ for x in $(jot $(sysctl -n hw.ncpu) 0) ; do sudo cpucontrol -m 0x179 > /dev/cpuctl$x; done | uniq -c > 16 MSR 0x179: 0x00000000 0x0f000c14 > 16 MSR 0x179: 0x00000000 0x0f000814 It probably explains it, but it would be more telling if you left the output as is, so that we can see which CPUs have MCG_CMCI_P (10) bit set. I suspect that your machine has two sockets, and processor in one socket has CPUs reporting MCG_CMCI_P, while other processor does not. Your SMP is not quite symmetric, perhaps processors were from different bins? If BSP is selected on reporting socket, everything boots well. If other socket wins the BSP selection race, cmci is not initialized, but when per-cpu mca_init() sees CMCI_P bit, it calls cmci_setup() without allocated cmc state, because BSP did not needed it. If I am right, then unconditionally allocating the memory is probably the only choice there. commit 2e2c925ac3b626edc6492a57a80f6b87895801c2 Author: Konstantin Belousov <kib@FreeBSD.org> Date: Fri Feb 5 04:32:05 2021 +0200 x86 mca: unconditionally allocate memory for cmc state diff --git a/sys/x86/x86/mca.c b/sys/x86/x86/mca.c index 03100e77d455..dff3f7631f5c 100644 --- a/sys/x86/x86/mca.c +++ b/sys/x86/x86/mca.c @@ -1047,7 +1047,7 @@ mca_setup(uint64_t mcg_cap) "force_scan", CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, NULL, 0, sysctl_mca_scan, "I", "Force an immediate scan for machine checks"); #ifdef DEV_APIC - if (cmci_supported(mcg_cap)) + if (cpu_vendor_id == CPU_VENDOR_INTEL) cmci_setup(); else if (amd_thresholding_supported()) amd_thresholding_setup();
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YBywA/5PHEqDJ4J4>