Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 7 Feb 2021 14:33:11 -0700
From:      Alan Somers <asomers@freebsd.org>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        Mark Johnston <markj@freebsd.org>, Matthew Macy <mmacy@freebsd.org>,  FreeBSD Stable ML <stable@freebsd.org>
Subject:   Re: Page fault in _mca_init during startup
Message-ID:  <CAOtMX2iF7QCNvNfU2CSseH-mgNGudZ_TCVoXuoF%2BPE9sk_TB6Q@mail.gmail.com>
In-Reply-To: <YB1%2BiUxs1ZITHaR/@kib.kiev.ua>
References:  <CAPrugNofKuCZmdkb41j%2Bu%2BX0BPV-cK8WjgrBu7akuD=XezseMw@mail.gmail.com> <YBx8GmXvmLnwFYql@kib.kiev.ua> <YByC1ZXP5sNE6aHj@raichu> <CAOtMX2gzaSgL1SosoTYaVqWYVHALpnFSpDQQu1w%2BBEwkO_g=AQ@mail.gmail.com> <YByYZDEbGlSsgcwv@kib.kiev.ua> <CAOtMX2hY2WFvtuG2U_4PCqL8fPTqmVPKgHkmh-A88GBz85obNw@mail.gmail.com> <YBywA/5PHEqDJ4J4@kib.kiev.ua> <CAOtMX2iXXgBuXWVBmS3oorZd7UxTgvYPPh9eTSfTNvTn8q_TSw@mail.gmail.com> <YB1ZDMGCOL%2BJ0SWE@kib.kiev.ua> <CAOtMX2g1Nz8BzRUhbeygTAniVObCTT2F0_U3se2kKOhnKJbjAQ@mail.gmail.com> <YB1%2BiUxs1ZITHaR/@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Feb 5, 2021 at 10:21 AM Konstantin Belousov <kostikbel@gmail.com>
wrote:

> On Fri, Feb 05, 2021 at 09:01:26AM -0700, Alan Somers wrote:
> > On Fri, Feb 5, 2021 at 7:41 AM Konstantin Belousov <kostikbel@gmail.com>
> > wrote:
> >
> > > On Thu, Feb 04, 2021 at 07:53:09PM -0700, Alan Somers wrote:
> > > > On Thu, Feb 4, 2021 at 7:40 PM Konstantin Belousov <
> kostikbel@gmail.com>
> > > > wrote:
> > > >
> > > > > On Thu, Feb 04, 2021 at 07:01:30PM -0700, Alan Somers wrote:
> > > > > > On Thu, Feb 4, 2021 at 5:59 PM Konstantin Belousov <
> > > kostikbel@gmail.com>
> > > > > > wrote:
> > > > > > > Do you have INVARIANTS enabled?  If not, I am curious if
> enabling
> > > them
> > > > > > > would convert that rare page fault into rare "CPU %d has more
> MC
> > > banks"
> > > > > > > assert.
> > > > > > >
> > > > > > > Also might be the output of the
> > > > > > > # for x in $(jot $(sysctl -n hw.ncpu) 0) ; do cpucontrol -m
> 0x179
> > > > > > > /dev/cpuctl$x; done
> > > > > > > command will show the issue (0x179 is the MCG_CAP MSR).
> > > > > > > You need to load cpuctl(4) if it is not loaded yet.
> > > > > > >
> > > > > >
> > > > > > I don't have INVARIANTS enabled, and I can't enable it on the
> > > production
> > > > > > servers.  However, I can turn those three KASSERTs into VERIFYs
> and
> > > see
> > > > > > what happens.  Here is what your command shows on the server that
> > > > > panicked:
> > > > > > $ for x in $(jot $(sysctl -n hw.ncpu) 0) ; do sudo cpucontrol -m
> > > 0x179
> > > > > > /dev/cpuctl$x; done | uniq -c
> > > > > >   16 MSR 0x179: 0x00000000 0x0f000c14
> > > > > >   16 MSR 0x179: 0x00000000 0x0f000814
> > > > >
> > > > > It probably explains it, but it would be more telling if you left
> the
> > > > > output as is, so that we can see which CPUs have MCG_CMCI_P (10)
> bit
> > > set.
> > > > >
> > > >
> > > > I didn't sort them, so the first 16 have bit 10 set and the second 16
> > > > don't.
> > > >
> > > >
> > > > >
> > > > > I suspect that your machine has two sockets, and processor in one
> > > socket
> > > > > has CPUs reporting MCG_CMCI_P, while other processor does not.
> Your SMP
> > > > > is not quite symmetric, perhaps processors were from different
> bins?
> > >
> >
> > I found 2 other servers that exhibit the same problem: the first 16 cores
> > have bit 10 set and the second 16 don't.  All 3 have dual Xeon Gold 6142
> > CPUs and SuperMicro X11DPU motherboards with BIOS revision 5.12.  I have
> > other examples of X11DPU motherboards that don't exhibit the problem, but
> > they all have both different CPUs and different BIOS revisions.  So I
> can't
> > be sure whether the bug follows the CPU model or the BIOS version.
> I looked at the full spec update errata list for the first gen Skylake
> Xeons, but did not noticed anything relevant. EDS doc does not provide
> much useful info on the MSR 0x179 bit 10 either, except rewording SDM
> definition.
>
> In fact I am not sure but this bit might be writeable by software. Try
> to flip the bit with cpucontrol(8). Might be it is a BIOS bug after all.
>
> If you have Intel representative contact, or Supermicro contact, try to
> engage them.  I do not have any further ideas, since spec update does not
> mention the problem.
>
> >
> >
> > > > >
> > > >
> > > > Could be.  Is there some MSR that reports a more specific version
> number?
> > > There are CPUID %eax=1 values returned in %eax, but then it requires
> > > some interpretation.
> > >         # cpucontrol -i 1 /dev/cpuctl$x
> > > for $x iterating over the cpus.
> > >
> >
> > Apart from the Local APIC ID field, that returns the same value for all
> > processors.
> >
> > Your second patch doesn't cause any obvious problems on my dev system.
> I hope that you would confirm that the issue is solved by it, after some
> time.
>

Upgrading the BIOS fixed the problem, by clearing the MCG_CMCI_P bit on all
processors.  I don't have strong opinions about whether we should commit
kib's patch too.  Kib, what do you think?
-Alan



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2iF7QCNvNfU2CSseH-mgNGudZ_TCVoXuoF%2BPE9sk_TB6Q>