Date: Sun, 7 Feb 2021 14:33:11 -0700 From: Alan Somers <asomers@freebsd.org> To: Konstantin Belousov <kostikbel@gmail.com> Cc: Mark Johnston <markj@freebsd.org>, Matthew Macy <mmacy@freebsd.org>, FreeBSD Stable ML <stable@freebsd.org> Subject: Re: Page fault in _mca_init during startup Message-ID: <CAOtMX2iF7QCNvNfU2CSseH-mgNGudZ_TCVoXuoF%2BPE9sk_TB6Q@mail.gmail.com> In-Reply-To: <YB1%2BiUxs1ZITHaR/@kib.kiev.ua> References: <CAPrugNofKuCZmdkb41j%2Bu%2BX0BPV-cK8WjgrBu7akuD=XezseMw@mail.gmail.com> <YBx8GmXvmLnwFYql@kib.kiev.ua> <YByC1ZXP5sNE6aHj@raichu> <CAOtMX2gzaSgL1SosoTYaVqWYVHALpnFSpDQQu1w%2BBEwkO_g=AQ@mail.gmail.com> <YByYZDEbGlSsgcwv@kib.kiev.ua> <CAOtMX2hY2WFvtuG2U_4PCqL8fPTqmVPKgHkmh-A88GBz85obNw@mail.gmail.com> <YBywA/5PHEqDJ4J4@kib.kiev.ua> <CAOtMX2iXXgBuXWVBmS3oorZd7UxTgvYPPh9eTSfTNvTn8q_TSw@mail.gmail.com> <YB1ZDMGCOL%2BJ0SWE@kib.kiev.ua> <CAOtMX2g1Nz8BzRUhbeygTAniVObCTT2F0_U3se2kKOhnKJbjAQ@mail.gmail.com> <YB1%2BiUxs1ZITHaR/@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Feb 5, 2021 at 10:21 AM Konstantin Belousov <kostikbel@gmail.com> wrote: > On Fri, Feb 05, 2021 at 09:01:26AM -0700, Alan Somers wrote: > > On Fri, Feb 5, 2021 at 7:41 AM Konstantin Belousov <kostikbel@gmail.com> > > wrote: > > > > > On Thu, Feb 04, 2021 at 07:53:09PM -0700, Alan Somers wrote: > > > > On Thu, Feb 4, 2021 at 7:40 PM Konstantin Belousov < > kostikbel@gmail.com> > > > > wrote: > > > > > > > > > On Thu, Feb 04, 2021 at 07:01:30PM -0700, Alan Somers wrote: > > > > > > On Thu, Feb 4, 2021 at 5:59 PM Konstantin Belousov < > > > kostikbel@gmail.com> > > > > > > wrote: > > > > > > > Do you have INVARIANTS enabled? If not, I am curious if > enabling > > > them > > > > > > > would convert that rare page fault into rare "CPU %d has more > MC > > > banks" > > > > > > > assert. > > > > > > > > > > > > > > Also might be the output of the > > > > > > > # for x in $(jot $(sysctl -n hw.ncpu) 0) ; do cpucontrol -m > 0x179 > > > > > > > /dev/cpuctl$x; done > > > > > > > command will show the issue (0x179 is the MCG_CAP MSR). > > > > > > > You need to load cpuctl(4) if it is not loaded yet. > > > > > > > > > > > > > > > > > > > I don't have INVARIANTS enabled, and I can't enable it on the > > > production > > > > > > servers. However, I can turn those three KASSERTs into VERIFYs > and > > > see > > > > > > what happens. Here is what your command shows on the server that > > > > > panicked: > > > > > > $ for x in $(jot $(sysctl -n hw.ncpu) 0) ; do sudo cpucontrol -m > > > 0x179 > > > > > > /dev/cpuctl$x; done | uniq -c > > > > > > 16 MSR 0x179: 0x00000000 0x0f000c14 > > > > > > 16 MSR 0x179: 0x00000000 0x0f000814 > > > > > > > > > > It probably explains it, but it would be more telling if you left > the > > > > > output as is, so that we can see which CPUs have MCG_CMCI_P (10) > bit > > > set. > > > > > > > > > > > > > I didn't sort them, so the first 16 have bit 10 set and the second 16 > > > > don't. > > > > > > > > > > > > > > > > > > I suspect that your machine has two sockets, and processor in one > > > socket > > > > > has CPUs reporting MCG_CMCI_P, while other processor does not. > Your SMP > > > > > is not quite symmetric, perhaps processors were from different > bins? > > > > > > > I found 2 other servers that exhibit the same problem: the first 16 cores > > have bit 10 set and the second 16 don't. All 3 have dual Xeon Gold 6142 > > CPUs and SuperMicro X11DPU motherboards with BIOS revision 5.12. I have > > other examples of X11DPU motherboards that don't exhibit the problem, but > > they all have both different CPUs and different BIOS revisions. So I > can't > > be sure whether the bug follows the CPU model or the BIOS version. > I looked at the full spec update errata list for the first gen Skylake > Xeons, but did not noticed anything relevant. EDS doc does not provide > much useful info on the MSR 0x179 bit 10 either, except rewording SDM > definition. > > In fact I am not sure but this bit might be writeable by software. Try > to flip the bit with cpucontrol(8). Might be it is a BIOS bug after all. > > If you have Intel representative contact, or Supermicro contact, try to > engage them. I do not have any further ideas, since spec update does not > mention the problem. > > > > > > > > > > > > > > > > > > Could be. Is there some MSR that reports a more specific version > number? > > > There are CPUID %eax=1 values returned in %eax, but then it requires > > > some interpretation. > > > # cpucontrol -i 1 /dev/cpuctl$x > > > for $x iterating over the cpus. > > > > > > > Apart from the Local APIC ID field, that returns the same value for all > > processors. > > > > Your second patch doesn't cause any obvious problems on my dev system. > I hope that you would confirm that the issue is solved by it, after some > time. > Upgrading the BIOS fixed the problem, by clearing the MCG_CMCI_P bit on all processors. I don't have strong opinions about whether we should commit kib's patch too. Kib, what do you think? -Alan
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2iF7QCNvNfU2CSseH-mgNGudZ_TCVoXuoF%2BPE9sk_TB6Q>