From owner-freebsd-stable@freebsd.org Sun Feb 7 21:33:25 2021 Return-Path: Delivered-To: freebsd-stable@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 262BA539B4E for ; Sun, 7 Feb 2021 21:33:25 +0000 (UTC) (envelope-from asomers@gmail.com) Received: from mailman.nyi.freebsd.org (mailman.nyi.freebsd.org [IPv6:2610:1c1:1:606c::50:13]) by mx1.freebsd.org (Postfix) with ESMTP id 4DYj7m6YRCz3hPv for ; Sun, 7 Feb 2021 21:33:24 +0000 (UTC) (envelope-from asomers@gmail.com) Received: by mailman.nyi.freebsd.org (Postfix) id D280A539DA3; Sun, 7 Feb 2021 21:33:24 +0000 (UTC) Delivered-To: stable@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id D2449539DA2 for ; Sun, 7 Feb 2021 21:33:24 +0000 (UTC) (envelope-from asomers@gmail.com) Received: from mail-oi1-f179.google.com (mail-oi1-f179.google.com [209.85.167.179]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4DYj7l5ldZz3hFH; Sun, 7 Feb 2021 21:33:23 +0000 (UTC) (envelope-from asomers@gmail.com) Received: by mail-oi1-f179.google.com with SMTP id h6so13735153oie.5; Sun, 07 Feb 2021 13:33:23 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=7VLTJI0KqjUkpb0dS1E1+cZrVU3R4fkInMU60WMPI3g=; b=JhBqSElJU3kETRzlZtxcp3ldj4aU86LPaiOH9uxvGMKLWdbu7rxWcC6WccuM8v9dFj VgOOksor0TIYRqN7aGaq6R0lIQSujGee+0S8T5jtaYZozEypwXuU8Bl1lf3yMdhSDKnD y2Yhez//rxkIS+XQEa3MH3YExe6meN9NilYyKWyg2shbjrKcmlBIDCc9cPtdSbf1L05c dvd0vLocexI+P+1xNYIAVdOOnuQvcW2yg2RqUfGAJOVvClM34CPFmOb6tYplMo2hMnaG 03WMcqSV6Hel+RixG8y0Ke1N3OkQ3L7nbDGr12t7awy/ZOAB9NYYsDejnXbOuVWJqGzR XorQ== X-Gm-Message-State: AOAM533ZpqqqpEOCWUoJ2gZo5r8CAJDQn45f//8c2vyx0tqVBIRYdLDh 02Kdw4Eg2at+6OXisSLSaTaAVvhRE2beOy7t4QYgC59nGjk/wA== X-Google-Smtp-Source: ABdhPJz1jTd6Mfmpy+uHK3LmYoMAz9fHPkHPECfpIketKx1sDw90dYrjJBOC2mQCs3Y5k60fGJbcywWAPINAQQvCRdI= X-Received: by 2002:a54:4813:: with SMTP id j19mr9531036oij.73.1612733602645; Sun, 07 Feb 2021 13:33:22 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Alan Somers Date: Sun, 7 Feb 2021 14:33:11 -0700 Message-ID: Subject: Re: Page fault in _mca_init during startup To: Konstantin Belousov Cc: Mark Johnston , Matthew Macy , FreeBSD Stable ML X-Rspamd-Queue-Id: 4DYj7l5ldZz3hFH X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of asomers@gmail.com designates 209.85.167.179 as permitted sender) smtp.mailfrom=asomers@gmail.com X-Spamd-Result: default: False [-3.00 / 15.00]; R_SPF_ALLOW(-0.20)[+ip4:209.85.128.0/17]; TO_DN_ALL(0.00)[]; NEURAL_HAM_SHORT(-1.00)[-1.000]; FREEMAIL_TO(0.00)[gmail.com]; FORGED_SENDER(0.30)[asomers@freebsd.org,asomers@gmail.com]; MIME_TRACE(0.00)[0:+,1:+,2:~]; FREEMAIL_ENVFROM(0.00)[gmail.com]; RBL_DBL_DONT_QUERY_IPS(0.00)[209.85.167.179:from]; R_DKIM_NA(0.00)[]; FROM_NEQ_ENVFROM(0.00)[asomers@freebsd.org,asomers@gmail.com]; ASN(0.00)[asn:15169, ipnet:209.85.128.0/17, country:US]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; FREEFALL_USER(0.00)[asomers]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[4]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; DMARC_NA(0.00)[freebsd.org]; SPAMHAUS_ZRD(0.00)[209.85.167.179:from:127.0.2.255]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_IN_DNSWL_NONE(0.00)[209.85.167.179:from]; RWL_MAILSPIKE_POSSIBLE(0.00)[209.85.167.179:from]; RCVD_COUNT_TWO(0.00)[2]; RCVD_TLS_ALL(0.00)[]; MAILMAN_DEST(0.00)[stable] Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.34 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 07 Feb 2021 21:33:25 -0000 On Fri, Feb 5, 2021 at 10:21 AM Konstantin Belousov wrote: > On Fri, Feb 05, 2021 at 09:01:26AM -0700, Alan Somers wrote: > > On Fri, Feb 5, 2021 at 7:41 AM Konstantin Belousov > > wrote: > > > > > On Thu, Feb 04, 2021 at 07:53:09PM -0700, Alan Somers wrote: > > > > On Thu, Feb 4, 2021 at 7:40 PM Konstantin Belousov < > kostikbel@gmail.com> > > > > wrote: > > > > > > > > > On Thu, Feb 04, 2021 at 07:01:30PM -0700, Alan Somers wrote: > > > > > > On Thu, Feb 4, 2021 at 5:59 PM Konstantin Belousov < > > > kostikbel@gmail.com> > > > > > > wrote: > > > > > > > Do you have INVARIANTS enabled? If not, I am curious if > enabling > > > them > > > > > > > would convert that rare page fault into rare "CPU %d has more > MC > > > banks" > > > > > > > assert. > > > > > > > > > > > > > > Also might be the output of the > > > > > > > # for x in $(jot $(sysctl -n hw.ncpu) 0) ; do cpucontrol -m > 0x179 > > > > > > > /dev/cpuctl$x; done > > > > > > > command will show the issue (0x179 is the MCG_CAP MSR). > > > > > > > You need to load cpuctl(4) if it is not loaded yet. > > > > > > > > > > > > > > > > > > > I don't have INVARIANTS enabled, and I can't enable it on the > > > production > > > > > > servers. However, I can turn those three KASSERTs into VERIFYs > and > > > see > > > > > > what happens. Here is what your command shows on the server that > > > > > panicked: > > > > > > $ for x in $(jot $(sysctl -n hw.ncpu) 0) ; do sudo cpucontrol -m > > > 0x179 > > > > > > /dev/cpuctl$x; done | uniq -c > > > > > > 16 MSR 0x179: 0x00000000 0x0f000c14 > > > > > > 16 MSR 0x179: 0x00000000 0x0f000814 > > > > > > > > > > It probably explains it, but it would be more telling if you left > the > > > > > output as is, so that we can see which CPUs have MCG_CMCI_P (10) > bit > > > set. > > > > > > > > > > > > > I didn't sort them, so the first 16 have bit 10 set and the second 16 > > > > don't. > > > > > > > > > > > > > > > > > > I suspect that your machine has two sockets, and processor in one > > > socket > > > > > has CPUs reporting MCG_CMCI_P, while other processor does not. > Your SMP > > > > > is not quite symmetric, perhaps processors were from different > bins? > > > > > > > I found 2 other servers that exhibit the same problem: the first 16 cores > > have bit 10 set and the second 16 don't. All 3 have dual Xeon Gold 6142 > > CPUs and SuperMicro X11DPU motherboards with BIOS revision 5.12. I have > > other examples of X11DPU motherboards that don't exhibit the problem, but > > they all have both different CPUs and different BIOS revisions. So I > can't > > be sure whether the bug follows the CPU model or the BIOS version. > I looked at the full spec update errata list for the first gen Skylake > Xeons, but did not noticed anything relevant. EDS doc does not provide > much useful info on the MSR 0x179 bit 10 either, except rewording SDM > definition. > > In fact I am not sure but this bit might be writeable by software. Try > to flip the bit with cpucontrol(8). Might be it is a BIOS bug after all. > > If you have Intel representative contact, or Supermicro contact, try to > engage them. I do not have any further ideas, since spec update does not > mention the problem. > > > > > > > > > > > > > > > > > > Could be. Is there some MSR that reports a more specific version > number? > > > There are CPUID %eax=1 values returned in %eax, but then it requires > > > some interpretation. > > > # cpucontrol -i 1 /dev/cpuctl$x > > > for $x iterating over the cpus. > > > > > > > Apart from the Local APIC ID field, that returns the same value for all > > processors. > > > > Your second patch doesn't cause any obvious problems on my dev system. > I hope that you would confirm that the issue is solved by it, after some > time. > Upgrading the BIOS fixed the problem, by clearing the MCG_CMCI_P bit on all processors. I don't have strong opinions about whether we should commit kib's patch too. Kib, what do you think? -Alan