From owner-freebsd-stable@freebsd.org Fri Feb 5 17:21:22 2021 Return-Path: Delivered-To: freebsd-stable@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 6F30654D0E1 for ; Fri, 5 Feb 2021 17:21:22 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mailman.nyi.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id 4DXMdt1y0fz3nPr for ; Fri, 5 Feb 2021 17:21:22 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: by mailman.nyi.freebsd.org (Postfix) id 42EA654CE78; Fri, 5 Feb 2021 17:21:22 +0000 (UTC) Delivered-To: stable@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 42B5D54D307 for ; Fri, 5 Feb 2021 17:21:22 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4DXMds5stDz3nV9; Fri, 5 Feb 2021 17:21:21 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.16.1/8.16.1) with ESMTPS id 115HLDWQ097161 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NO); Fri, 5 Feb 2021 19:21:17 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua 115HLDWQ097161 Received: (from kostik@localhost) by tom.home (8.16.1/8.16.1/Submit) id 115HLD6U097160; Fri, 5 Feb 2021 19:21:13 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Fri, 5 Feb 2021 19:21:13 +0200 From: Konstantin Belousov To: Alan Somers Cc: Mark Johnston , Matthew Macy , FreeBSD Stable ML Subject: Re: Page fault in _mca_init during startup Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FORGED_GMAIL_RCVD,FREEMAIL_FROM, NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on tom.home X-Rspamd-Queue-Id: 4DXMds5stDz3nV9 X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; none X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[] X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 05 Feb 2021 17:21:22 -0000 On Fri, Feb 05, 2021 at 09:01:26AM -0700, Alan Somers wrote: > On Fri, Feb 5, 2021 at 7:41 AM Konstantin Belousov > wrote: > > > On Thu, Feb 04, 2021 at 07:53:09PM -0700, Alan Somers wrote: > > > On Thu, Feb 4, 2021 at 7:40 PM Konstantin Belousov > > > wrote: > > > > > > > On Thu, Feb 04, 2021 at 07:01:30PM -0700, Alan Somers wrote: > > > > > On Thu, Feb 4, 2021 at 5:59 PM Konstantin Belousov < > > kostikbel@gmail.com> > > > > > wrote: > > > > > > Do you have INVARIANTS enabled? If not, I am curious if enabling > > them > > > > > > would convert that rare page fault into rare "CPU %d has more MC > > banks" > > > > > > assert. > > > > > > > > > > > > Also might be the output of the > > > > > > # for x in $(jot $(sysctl -n hw.ncpu) 0) ; do cpucontrol -m 0x179 > > > > > > /dev/cpuctl$x; done > > > > > > command will show the issue (0x179 is the MCG_CAP MSR). > > > > > > You need to load cpuctl(4) if it is not loaded yet. > > > > > > > > > > > > > > > > I don't have INVARIANTS enabled, and I can't enable it on the > > production > > > > > servers. However, I can turn those three KASSERTs into VERIFYs and > > see > > > > > what happens. Here is what your command shows on the server that > > > > panicked: > > > > > $ for x in $(jot $(sysctl -n hw.ncpu) 0) ; do sudo cpucontrol -m > > 0x179 > > > > > /dev/cpuctl$x; done | uniq -c > > > > > 16 MSR 0x179: 0x00000000 0x0f000c14 > > > > > 16 MSR 0x179: 0x00000000 0x0f000814 > > > > > > > > It probably explains it, but it would be more telling if you left the > > > > output as is, so that we can see which CPUs have MCG_CMCI_P (10) bit > > set. > > > > > > > > > > I didn't sort them, so the first 16 have bit 10 set and the second 16 > > > don't. > > > > > > > > > > > > > > I suspect that your machine has two sockets, and processor in one > > socket > > > > has CPUs reporting MCG_CMCI_P, while other processor does not. Your SMP > > > > is not quite symmetric, perhaps processors were from different bins? > > > > I found 2 other servers that exhibit the same problem: the first 16 cores > have bit 10 set and the second 16 don't. All 3 have dual Xeon Gold 6142 > CPUs and SuperMicro X11DPU motherboards with BIOS revision 5.12. I have > other examples of X11DPU motherboards that don't exhibit the problem, but > they all have both different CPUs and different BIOS revisions. So I can't > be sure whether the bug follows the CPU model or the BIOS version. I looked at the full spec update errata list for the first gen Skylake Xeons, but did not noticed anything relevant. EDS doc does not provide much useful info on the MSR 0x179 bit 10 either, except rewording SDM definition. In fact I am not sure but this bit might be writeable by software. Try to flip the bit with cpucontrol(8). Might be it is a BIOS bug after all. If you have Intel representative contact, or Supermicro contact, try to engage them. I do not have any further ideas, since spec update does not mention the problem. > > > > > > > > > > > > Could be. Is there some MSR that reports a more specific version number? > > There are CPUID %eax=1 values returned in %eax, but then it requires > > some interpretation. > > # cpucontrol -i 1 /dev/cpuctl$x > > for $x iterating over the cpus. > > > > Apart from the Local APIC ID field, that returns the same value for all > processors. > > Your second patch doesn't cause any obvious problems on my dev system. I hope that you would confirm that the issue is solved by it, after some time.