From owner-freebsd-stable@freebsd.org Fri Feb 5 14:41:25 2021 Return-Path: Delivered-To: freebsd-stable@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id AA04F54983E for ; Fri, 5 Feb 2021 14:41:25 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mailman.nyi.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id 4DXJ5K34Lqz3Ljf for ; Fri, 5 Feb 2021 14:41:25 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: by mailman.nyi.freebsd.org (Postfix) id 68D115497D8; Fri, 5 Feb 2021 14:41:25 +0000 (UTC) Delivered-To: stable@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 689225498C0 for ; Fri, 5 Feb 2021 14:41:25 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4DXJ5K0x68z3LpM; Fri, 5 Feb 2021 14:41:24 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.16.1/8.16.1) with ESMTPS id 115EfGB7057236 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NO); Fri, 5 Feb 2021 16:41:19 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua 115EfGB7057236 Received: (from kostik@localhost) by tom.home (8.16.1/8.16.1/Submit) id 115EfGf3057235; Fri, 5 Feb 2021 16:41:16 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Fri, 5 Feb 2021 16:41:16 +0200 From: Konstantin Belousov To: Alan Somers Cc: Mark Johnston , Matthew Macy , FreeBSD Stable ML Subject: Re: Page fault in _mca_init during startup Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FORGED_GMAIL_RCVD,FREEMAIL_FROM, NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on tom.home X-Rspamd-Queue-Id: 4DXJ5K0x68z3LpM X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; none X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[] X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 05 Feb 2021 14:41:25 -0000 On Thu, Feb 04, 2021 at 07:53:09PM -0700, Alan Somers wrote: > On Thu, Feb 4, 2021 at 7:40 PM Konstantin Belousov > wrote: > > > On Thu, Feb 04, 2021 at 07:01:30PM -0700, Alan Somers wrote: > > > On Thu, Feb 4, 2021 at 5:59 PM Konstantin Belousov > > > wrote: > > > > Do you have INVARIANTS enabled? If not, I am curious if enabling them > > > > would convert that rare page fault into rare "CPU %d has more MC banks" > > > > assert. > > > > > > > > Also might be the output of the > > > > # for x in $(jot $(sysctl -n hw.ncpu) 0) ; do cpucontrol -m 0x179 > > > > /dev/cpuctl$x; done > > > > command will show the issue (0x179 is the MCG_CAP MSR). > > > > You need to load cpuctl(4) if it is not loaded yet. > > > > > > > > > > I don't have INVARIANTS enabled, and I can't enable it on the production > > > servers. However, I can turn those three KASSERTs into VERIFYs and see > > > what happens. Here is what your command shows on the server that > > panicked: > > > $ for x in $(jot $(sysctl -n hw.ncpu) 0) ; do sudo cpucontrol -m 0x179 > > > /dev/cpuctl$x; done | uniq -c > > > 16 MSR 0x179: 0x00000000 0x0f000c14 > > > 16 MSR 0x179: 0x00000000 0x0f000814 > > > > It probably explains it, but it would be more telling if you left the > > output as is, so that we can see which CPUs have MCG_CMCI_P (10) bit set. > > > > I didn't sort them, so the first 16 have bit 10 set and the second 16 > don't. > > > > > > I suspect that your machine has two sockets, and processor in one socket > > has CPUs reporting MCG_CMCI_P, while other processor does not. Your SMP > > is not quite symmetric, perhaps processors were from different bins? > > > > Could be. Is there some MSR that reports a more specific version number? There are CPUID %eax=1 values returned in %eax, but then it requires some interpretation. # cpucontrol -i 1 /dev/cpuctl$x for $x iterating over the cpus.