Date: Fri, 5 Feb 2021 16:41:16 +0200 From: Konstantin Belousov <kostikbel@gmail.com> To: Alan Somers <asomers@freebsd.org> Cc: Mark Johnston <markj@freebsd.org>, Matthew Macy <mmacy@freebsd.org>, FreeBSD Stable ML <stable@freebsd.org> Subject: Re: Page fault in _mca_init during startup Message-ID: <YB1ZDMGCOL%2BJ0SWE@kib.kiev.ua> In-Reply-To: <CAOtMX2iXXgBuXWVBmS3oorZd7UxTgvYPPh9eTSfTNvTn8q_TSw@mail.gmail.com> References: <CAOtMX2imwP3x-8LBKGFvMJ%2BjuD%2BsH_02yzs9XvMcCHY=jJs86A@mail.gmail.com> <CAPrugNofKuCZmdkb41j%2Bu%2BX0BPV-cK8WjgrBu7akuD=XezseMw@mail.gmail.com> <YBx8GmXvmLnwFYql@kib.kiev.ua> <YByC1ZXP5sNE6aHj@raichu> <CAOtMX2gzaSgL1SosoTYaVqWYVHALpnFSpDQQu1w%2BBEwkO_g=AQ@mail.gmail.com> <YByYZDEbGlSsgcwv@kib.kiev.ua> <CAOtMX2hY2WFvtuG2U_4PCqL8fPTqmVPKgHkmh-A88GBz85obNw@mail.gmail.com> <YBywA/5PHEqDJ4J4@kib.kiev.ua> <CAOtMX2iXXgBuXWVBmS3oorZd7UxTgvYPPh9eTSfTNvTn8q_TSw@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Feb 04, 2021 at 07:53:09PM -0700, Alan Somers wrote: > On Thu, Feb 4, 2021 at 7:40 PM Konstantin Belousov <kostikbel@gmail.com> > wrote: > > > On Thu, Feb 04, 2021 at 07:01:30PM -0700, Alan Somers wrote: > > > On Thu, Feb 4, 2021 at 5:59 PM Konstantin Belousov <kostikbel@gmail.com> > > > wrote: > > > > Do you have INVARIANTS enabled? If not, I am curious if enabling them > > > > would convert that rare page fault into rare "CPU %d has more MC banks" > > > > assert. > > > > > > > > Also might be the output of the > > > > # for x in $(jot $(sysctl -n hw.ncpu) 0) ; do cpucontrol -m 0x179 > > > > /dev/cpuctl$x; done > > > > command will show the issue (0x179 is the MCG_CAP MSR). > > > > You need to load cpuctl(4) if it is not loaded yet. > > > > > > > > > > I don't have INVARIANTS enabled, and I can't enable it on the production > > > servers. However, I can turn those three KASSERTs into VERIFYs and see > > > what happens. Here is what your command shows on the server that > > panicked: > > > $ for x in $(jot $(sysctl -n hw.ncpu) 0) ; do sudo cpucontrol -m 0x179 > > > /dev/cpuctl$x; done | uniq -c > > > 16 MSR 0x179: 0x00000000 0x0f000c14 > > > 16 MSR 0x179: 0x00000000 0x0f000814 > > > > It probably explains it, but it would be more telling if you left the > > output as is, so that we can see which CPUs have MCG_CMCI_P (10) bit set. > > > > I didn't sort them, so the first 16 have bit 10 set and the second 16 > don't. > > > > > > I suspect that your machine has two sockets, and processor in one socket > > has CPUs reporting MCG_CMCI_P, while other processor does not. Your SMP > > is not quite symmetric, perhaps processors were from different bins? > > > > Could be. Is there some MSR that reports a more specific version number? There are CPUID %eax=1 values returned in %eax, but then it requires some interpretation. # cpucontrol -i 1 /dev/cpuctl$x for $x iterating over the cpus.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YB1ZDMGCOL%2BJ0SWE>