Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 4 Feb 2021 19:53:09 -0700
From:      Alan Somers <asomers@freebsd.org>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        Mark Johnston <markj@freebsd.org>, Matthew Macy <mmacy@freebsd.org>,  FreeBSD Stable ML <stable@freebsd.org>
Subject:   Re: Page fault in _mca_init during startup
Message-ID:  <CAOtMX2iXXgBuXWVBmS3oorZd7UxTgvYPPh9eTSfTNvTn8q_TSw@mail.gmail.com>
In-Reply-To: <YBywA/5PHEqDJ4J4@kib.kiev.ua>
References:  <CAOtMX2imwP3x-8LBKGFvMJ%2BjuD%2BsH_02yzs9XvMcCHY=jJs86A@mail.gmail.com> <CAPrugNofKuCZmdkb41j%2Bu%2BX0BPV-cK8WjgrBu7akuD=XezseMw@mail.gmail.com> <YBx8GmXvmLnwFYql@kib.kiev.ua> <YByC1ZXP5sNE6aHj@raichu> <CAOtMX2gzaSgL1SosoTYaVqWYVHALpnFSpDQQu1w%2BBEwkO_g=AQ@mail.gmail.com> <YByYZDEbGlSsgcwv@kib.kiev.ua> <CAOtMX2hY2WFvtuG2U_4PCqL8fPTqmVPKgHkmh-A88GBz85obNw@mail.gmail.com> <YBywA/5PHEqDJ4J4@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Feb 4, 2021 at 7:40 PM Konstantin Belousov <kostikbel@gmail.com>
wrote:

> On Thu, Feb 04, 2021 at 07:01:30PM -0700, Alan Somers wrote:
> > On Thu, Feb 4, 2021 at 5:59 PM Konstantin Belousov <kostikbel@gmail.com>
> > wrote:
> > > Do you have INVARIANTS enabled?  If not, I am curious if enabling them
> > > would convert that rare page fault into rare "CPU %d has more MC banks"
> > > assert.
> > >
> > > Also might be the output of the
> > > # for x in $(jot $(sysctl -n hw.ncpu) 0) ; do cpucontrol -m 0x179
> > > /dev/cpuctl$x; done
> > > command will show the issue (0x179 is the MCG_CAP MSR).
> > > You need to load cpuctl(4) if it is not loaded yet.
> > >
> >
> > I don't have INVARIANTS enabled, and I can't enable it on the production
> > servers.  However, I can turn those three KASSERTs into VERIFYs and see
> > what happens.  Here is what your command shows on the server that
> panicked:
> > $ for x in $(jot $(sysctl -n hw.ncpu) 0) ; do sudo cpucontrol -m 0x179
> > /dev/cpuctl$x; done | uniq -c
> >   16 MSR 0x179: 0x00000000 0x0f000c14
> >   16 MSR 0x179: 0x00000000 0x0f000814
>
> It probably explains it, but it would be more telling if you left the
> output as is, so that we can see which CPUs have MCG_CMCI_P (10) bit set.
>

I didn't sort them, so the first 16 have bit 10 set and the second 16
don't.


>
> I suspect that your machine has two sockets, and processor in one socket
> has CPUs reporting MCG_CMCI_P, while other processor does not. Your SMP
> is not quite symmetric, perhaps processors were from different bins?
>

Could be.  Is there some MSR that reports a more specific version number?


>
> If BSP is selected on reporting socket, everything boots well. If
> other socket wins the BSP selection race, cmci is not initialized, but
> when per-cpu mca_init() sees CMCI_P bit, it calls cmci_setup() without
> allocated cmc state, because BSP did not needed it.
>
> If I am right, then unconditionally allocating the memory is probably the
> only choice there.
>
> commit 2e2c925ac3b626edc6492a57a80f6b87895801c2
> Author: Konstantin Belousov <kib@FreeBSD.org>
> Date:   Fri Feb 5 04:32:05 2021 +0200
>
>     x86 mca: unconditionally allocate memory for cmc state
>
> diff --git a/sys/x86/x86/mca.c b/sys/x86/x86/mca.c
> index 03100e77d455..dff3f7631f5c 100644
> --- a/sys/x86/x86/mca.c
> +++ b/sys/x86/x86/mca.c
> @@ -1047,7 +1047,7 @@ mca_setup(uint64_t mcg_cap)
>             "force_scan", CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, NULL,
> 0,
>             sysctl_mca_scan, "I", "Force an immediate scan for machine
> checks");
>  #ifdef DEV_APIC
> -       if (cmci_supported(mcg_cap))
> +       if (cpu_vendor_id == CPU_VENDOR_INTEL)
>                 cmci_setup();
>         else if (amd_thresholding_supported())
>                 amd_thresholding_setup();
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2iXXgBuXWVBmS3oorZd7UxTgvYPPh9eTSfTNvTn8q_TSw>