From owner-freebsd-stable@freebsd.org Fri Feb 5 02:40:12 2021 Return-Path: Delivered-To: freebsd-stable@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 77FE853A0A0 for ; Fri, 5 Feb 2021 02:40:12 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mailman.nyi.freebsd.org (mailman.nyi.freebsd.org [IPv6:2610:1c1:1:606c::50:13]) by mx1.freebsd.org (Postfix) with ESMTP id 4DX05825PHz3rwg for ; Fri, 5 Feb 2021 02:40:12 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: by mailman.nyi.freebsd.org (Postfix) id 47C72539FB1; Fri, 5 Feb 2021 02:40:12 +0000 (UTC) Delivered-To: stable@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 47933539FB0 for ; Fri, 5 Feb 2021 02:40:12 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4DX0576zqKz3ryj; Fri, 5 Feb 2021 02:40:11 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.16.1/8.16.1) with ESMTPS id 1152e4ho084476 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NO); Fri, 5 Feb 2021 04:40:07 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua 1152e4ho084476 Received: (from kostik@localhost) by tom.home (8.16.1/8.16.1/Submit) id 1152e3wL084467; Fri, 5 Feb 2021 04:40:03 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Fri, 5 Feb 2021 04:40:03 +0200 From: Konstantin Belousov To: Alan Somers Cc: Mark Johnston , Matthew Macy , FreeBSD Stable ML Subject: Re: Page fault in _mca_init during startup Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FORGED_GMAIL_RCVD,FREEMAIL_FROM, NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on tom.home X-Rspamd-Queue-Id: 4DX0576zqKz3ryj X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; none X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[] X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 05 Feb 2021 02:40:12 -0000 On Thu, Feb 04, 2021 at 07:01:30PM -0700, Alan Somers wrote: > On Thu, Feb 4, 2021 at 5:59 PM Konstantin Belousov > wrote: > > Do you have INVARIANTS enabled? If not, I am curious if enabling them > > would convert that rare page fault into rare "CPU %d has more MC banks" > > assert. > > > > Also might be the output of the > > # for x in $(jot $(sysctl -n hw.ncpu) 0) ; do cpucontrol -m 0x179 > > /dev/cpuctl$x; done > > command will show the issue (0x179 is the MCG_CAP MSR). > > You need to load cpuctl(4) if it is not loaded yet. > > > > I don't have INVARIANTS enabled, and I can't enable it on the production > servers. However, I can turn those three KASSERTs into VERIFYs and see > what happens. Here is what your command shows on the server that panicked: > $ for x in $(jot $(sysctl -n hw.ncpu) 0) ; do sudo cpucontrol -m 0x179 > /dev/cpuctl$x; done | uniq -c > 16 MSR 0x179: 0x00000000 0x0f000c14 > 16 MSR 0x179: 0x00000000 0x0f000814 It probably explains it, but it would be more telling if you left the output as is, so that we can see which CPUs have MCG_CMCI_P (10) bit set. I suspect that your machine has two sockets, and processor in one socket has CPUs reporting MCG_CMCI_P, while other processor does not. Your SMP is not quite symmetric, perhaps processors were from different bins? If BSP is selected on reporting socket, everything boots well. If other socket wins the BSP selection race, cmci is not initialized, but when per-cpu mca_init() sees CMCI_P bit, it calls cmci_setup() without allocated cmc state, because BSP did not needed it. If I am right, then unconditionally allocating the memory is probably the only choice there. commit 2e2c925ac3b626edc6492a57a80f6b87895801c2 Author: Konstantin Belousov Date: Fri Feb 5 04:32:05 2021 +0200 x86 mca: unconditionally allocate memory for cmc state diff --git a/sys/x86/x86/mca.c b/sys/x86/x86/mca.c index 03100e77d455..dff3f7631f5c 100644 --- a/sys/x86/x86/mca.c +++ b/sys/x86/x86/mca.c @@ -1047,7 +1047,7 @@ mca_setup(uint64_t mcg_cap) "force_scan", CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, NULL, 0, sysctl_mca_scan, "I", "Force an immediate scan for machine checks"); #ifdef DEV_APIC - if (cmci_supported(mcg_cap)) + if (cpu_vendor_id == CPU_VENDOR_INTEL) cmci_setup(); else if (amd_thresholding_supported()) amd_thresholding_setup();