Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 5 Feb 2021 09:01:26 -0700
From:      Alan Somers <asomers@freebsd.org>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        Mark Johnston <markj@freebsd.org>, Matthew Macy <mmacy@freebsd.org>,  FreeBSD Stable ML <stable@freebsd.org>
Subject:   Re: Page fault in _mca_init during startup
Message-ID:  <CAOtMX2g1Nz8BzRUhbeygTAniVObCTT2F0_U3se2kKOhnKJbjAQ@mail.gmail.com>
In-Reply-To: <YB1ZDMGCOL%2BJ0SWE@kib.kiev.ua>
References:  <CAOtMX2imwP3x-8LBKGFvMJ%2BjuD%2BsH_02yzs9XvMcCHY=jJs86A@mail.gmail.com> <CAPrugNofKuCZmdkb41j%2Bu%2BX0BPV-cK8WjgrBu7akuD=XezseMw@mail.gmail.com> <YBx8GmXvmLnwFYql@kib.kiev.ua> <YByC1ZXP5sNE6aHj@raichu> <CAOtMX2gzaSgL1SosoTYaVqWYVHALpnFSpDQQu1w%2BBEwkO_g=AQ@mail.gmail.com> <YByYZDEbGlSsgcwv@kib.kiev.ua> <CAOtMX2hY2WFvtuG2U_4PCqL8fPTqmVPKgHkmh-A88GBz85obNw@mail.gmail.com> <YBywA/5PHEqDJ4J4@kib.kiev.ua> <CAOtMX2iXXgBuXWVBmS3oorZd7UxTgvYPPh9eTSfTNvTn8q_TSw@mail.gmail.com> <YB1ZDMGCOL%2BJ0SWE@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Feb 5, 2021 at 7:41 AM Konstantin Belousov <kostikbel@gmail.com>
wrote:

> On Thu, Feb 04, 2021 at 07:53:09PM -0700, Alan Somers wrote:
> > On Thu, Feb 4, 2021 at 7:40 PM Konstantin Belousov <kostikbel@gmail.com>
> > wrote:
> >
> > > On Thu, Feb 04, 2021 at 07:01:30PM -0700, Alan Somers wrote:
> > > > On Thu, Feb 4, 2021 at 5:59 PM Konstantin Belousov <
> kostikbel@gmail.com>
> > > > wrote:
> > > > > Do you have INVARIANTS enabled?  If not, I am curious if enabling
> them
> > > > > would convert that rare page fault into rare "CPU %d has more MC
> banks"
> > > > > assert.
> > > > >
> > > > > Also might be the output of the
> > > > > # for x in $(jot $(sysctl -n hw.ncpu) 0) ; do cpucontrol -m 0x179
> > > > > /dev/cpuctl$x; done
> > > > > command will show the issue (0x179 is the MCG_CAP MSR).
> > > > > You need to load cpuctl(4) if it is not loaded yet.
> > > > >
> > > >
> > > > I don't have INVARIANTS enabled, and I can't enable it on the
> production
> > > > servers.  However, I can turn those three KASSERTs into VERIFYs and
> see
> > > > what happens.  Here is what your command shows on the server that
> > > panicked:
> > > > $ for x in $(jot $(sysctl -n hw.ncpu) 0) ; do sudo cpucontrol -m
> 0x179
> > > > /dev/cpuctl$x; done | uniq -c
> > > >   16 MSR 0x179: 0x00000000 0x0f000c14
> > > >   16 MSR 0x179: 0x00000000 0x0f000814
> > >
> > > It probably explains it, but it would be more telling if you left the
> > > output as is, so that we can see which CPUs have MCG_CMCI_P (10) bit
> set.
> > >
> >
> > I didn't sort them, so the first 16 have bit 10 set and the second 16
> > don't.
> >
> >
> > >
> > > I suspect that your machine has two sockets, and processor in one
> socket
> > > has CPUs reporting MCG_CMCI_P, while other processor does not. Your SMP
> > > is not quite symmetric, perhaps processors were from different bins?
>

I found 2 other servers that exhibit the same problem: the first 16 cores
have bit 10 set and the second 16 don't.  All 3 have dual Xeon Gold 6142
CPUs and SuperMicro X11DPU motherboards with BIOS revision 5.12.  I have
other examples of X11DPU motherboards that don't exhibit the problem, but
they all have both different CPUs and different BIOS revisions.  So I can't
be sure whether the bug follows the CPU model or the BIOS version.


> > >
> >
> > Could be.  Is there some MSR that reports a more specific version number?
> There are CPUID %eax=1 values returned in %eax, but then it requires
> some interpretation.
>         # cpucontrol -i 1 /dev/cpuctl$x
> for $x iterating over the cpus.
>

Apart from the Local APIC ID field, that returns the same value for all
processors.

Your second patch doesn't cause any obvious problems on my dev system.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2g1Nz8BzRUhbeygTAniVObCTT2F0_U3se2kKOhnKJbjAQ>