From owner-freebsd-stable@freebsd.org Fri Feb 5 16:01:40 2021 Return-Path: Delivered-To: freebsd-stable@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 5255854B05A for ; Fri, 5 Feb 2021 16:01:40 +0000 (UTC) (envelope-from asomers@gmail.com) Received: from mailman.nyi.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id 4DXKsw111Jz3R5Y for ; Fri, 5 Feb 2021 16:01:40 +0000 (UTC) (envelope-from asomers@gmail.com) Received: by mailman.nyi.freebsd.org (Postfix) id 2293D54B39C; Fri, 5 Feb 2021 16:01:40 +0000 (UTC) Delivered-To: stable@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 225D554B059 for ; Fri, 5 Feb 2021 16:01:40 +0000 (UTC) (envelope-from asomers@gmail.com) Received: from mail-ot1-f49.google.com (mail-ot1-f49.google.com [209.85.210.49]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4DXKsw04sZz3h0F; Fri, 5 Feb 2021 16:01:39 +0000 (UTC) (envelope-from asomers@gmail.com) Received: by mail-ot1-f49.google.com with SMTP id i20so7320788otl.7; Fri, 05 Feb 2021 08:01:39 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=EHSan6n//yaZyOfPcUv1LgZOK1GuNW2plrr5BdUCKB8=; b=U7wmJdeMShP6FACd1dinbV2UzzRqmGCGjCrEYHTLh59OqG26yRyeq3gb2eT6T4O1+v BgQIj4L0AIxwCAwUG7evfOjEK6Rjado3bHvrpOnb07betRFx2Cho9sZaxwbVOBPd6AjS gk6dqmCscaX5XI8jMRBk3ZSiKM3y/DfO/bWR8cgWeWPftLHNyApdXIArRdlC7DQDrQJs KObzvaoAEqowml3P+OYPna7IFf5rTaG46Wa0EphUeQ7k1j5U9xhTtwVNeIIjXWIviM7Z BRtjeFiidrqDkeBUvk4uw//udXp72sT7eKRpA/pB7yygsB7ZBciWdgDvue7PmM1RWCsZ ympA== X-Gm-Message-State: AOAM530caIPUVTUq0XiAG0RFFtFLJXoF1Mq05MZrbFL4lGH1QCIWvzcM ov+PKHN6waNwQ2P6znAiPjtfeHz1pVAdSuHWDpA= X-Google-Smtp-Source: ABdhPJxV3v7BJZc+dZmLM5daDZQr3jXWSRw54QETDs+92TfJvkxl0+X3dBVRRPVLnOuCmvmZQ57vVjJOR0e346m+NA4= X-Received: by 2002:a05:6830:1256:: with SMTP id s22mr3957420otp.251.1612540898380; Fri, 05 Feb 2021 08:01:38 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Alan Somers Date: Fri, 5 Feb 2021 09:01:26 -0700 Message-ID: Subject: Re: Page fault in _mca_init during startup To: Konstantin Belousov Cc: Mark Johnston , Matthew Macy , FreeBSD Stable ML X-Rspamd-Queue-Id: 4DXKsw04sZz3h0F X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; none X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[] Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.34 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 05 Feb 2021 16:01:40 -0000 On Fri, Feb 5, 2021 at 7:41 AM Konstantin Belousov wrote: > On Thu, Feb 04, 2021 at 07:53:09PM -0700, Alan Somers wrote: > > On Thu, Feb 4, 2021 at 7:40 PM Konstantin Belousov > > wrote: > > > > > On Thu, Feb 04, 2021 at 07:01:30PM -0700, Alan Somers wrote: > > > > On Thu, Feb 4, 2021 at 5:59 PM Konstantin Belousov < > kostikbel@gmail.com> > > > > wrote: > > > > > Do you have INVARIANTS enabled? If not, I am curious if enabling > them > > > > > would convert that rare page fault into rare "CPU %d has more MC > banks" > > > > > assert. > > > > > > > > > > Also might be the output of the > > > > > # for x in $(jot $(sysctl -n hw.ncpu) 0) ; do cpucontrol -m 0x179 > > > > > /dev/cpuctl$x; done > > > > > command will show the issue (0x179 is the MCG_CAP MSR). > > > > > You need to load cpuctl(4) if it is not loaded yet. > > > > > > > > > > > > > I don't have INVARIANTS enabled, and I can't enable it on the > production > > > > servers. However, I can turn those three KASSERTs into VERIFYs and > see > > > > what happens. Here is what your command shows on the server that > > > panicked: > > > > $ for x in $(jot $(sysctl -n hw.ncpu) 0) ; do sudo cpucontrol -m > 0x179 > > > > /dev/cpuctl$x; done | uniq -c > > > > 16 MSR 0x179: 0x00000000 0x0f000c14 > > > > 16 MSR 0x179: 0x00000000 0x0f000814 > > > > > > It probably explains it, but it would be more telling if you left the > > > output as is, so that we can see which CPUs have MCG_CMCI_P (10) bit > set. > > > > > > > I didn't sort them, so the first 16 have bit 10 set and the second 16 > > don't. > > > > > > > > > > I suspect that your machine has two sockets, and processor in one > socket > > > has CPUs reporting MCG_CMCI_P, while other processor does not. Your SMP > > > is not quite symmetric, perhaps processors were from different bins? > I found 2 other servers that exhibit the same problem: the first 16 cores have bit 10 set and the second 16 don't. All 3 have dual Xeon Gold 6142 CPUs and SuperMicro X11DPU motherboards with BIOS revision 5.12. I have other examples of X11DPU motherboards that don't exhibit the problem, but they all have both different CPUs and different BIOS revisions. So I can't be sure whether the bug follows the CPU model or the BIOS version. > > > > > > > Could be. Is there some MSR that reports a more specific version number? > There are CPUID %eax=1 values returned in %eax, but then it requires > some interpretation. > # cpucontrol -i 1 /dev/cpuctl$x > for $x iterating over the cpus. > Apart from the Local APIC ID field, that returns the same value for all processors. Your second patch doesn't cause any obvious problems on my dev system.