From owner-freebsd-stable@freebsd.org Fri Feb 5 02:01:42 2021 Return-Path: Delivered-To: freebsd-stable@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id BF7C0538C63 for ; Fri, 5 Feb 2021 02:01:42 +0000 (UTC) (envelope-from asomers@gmail.com) Received: from mailman.nyi.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id 4DWzDk44Wrz3q6H for ; Fri, 5 Feb 2021 02:01:42 +0000 (UTC) (envelope-from asomers@gmail.com) Received: by mailman.nyi.freebsd.org (Postfix) id 897F6538F53; Fri, 5 Feb 2021 02:01:42 +0000 (UTC) Delivered-To: stable@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 89458538ECC for ; Fri, 5 Feb 2021 02:01:42 +0000 (UTC) (envelope-from asomers@gmail.com) Received: from mail-ot1-f44.google.com (mail-ot1-f44.google.com [209.85.210.44]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4DWzDk3PMCz3py5; Fri, 5 Feb 2021 02:01:42 +0000 (UTC) (envelope-from asomers@gmail.com) Received: by mail-ot1-f44.google.com with SMTP id y11so5494604otq.1; Thu, 04 Feb 2021 18:01:42 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=GrOra/F0s0LWNRQxeDb1xXnZL0wNHRpphqVhyunagFU=; b=rF1t0xi9C/Xn7XKmd3/3iuVhOljkVVRJ8AsiEJvnz91PVM9OnHce1KWPh/U2fL5Sbb WWLo9KqCqrIkiBAXZ6tET9ixPc0MtQQ3gOmlNRUb91dJ+lu5bedC2ASIbj5MJqjIDNGM 8BllG+wsR/N+2oOO9idFSEfZoWskeDta1TP9EW0Is2Zv7wrXlUdRSVU5GAb+JSup2GgX Jar6YzJgm6ZRADZ7U8oz0DtolicK4S7Qs1NZvZN/QVfr0EYSV+YXdhevKnDliUpDglzN FzcoUoYS54hL0oYFE/5cw1oiLkUwxb8+qp8NXuRwjF6HnvfAvOLZA1mlttZ4IhtyrM6l 2+6w== X-Gm-Message-State: AOAM531cHkPRnP5HAK6CtzXDVrQc616cUWWx/YZ2xmbcX4VlPzl7SHaj yXV1659YIhwjBRuOAhuvINbWCVBgpJdLsrpi32c= X-Google-Smtp-Source: ABdhPJwvMAKiD50VHMFHVNNMHym59OwKz0zzX9+xBNdzZ3EsvFuF4SiPtdN16DODTKpNWM5Kg2UhUZ4JZ2Z56rMmaUs= X-Received: by 2002:a9d:2925:: with SMTP id d34mr1605886otb.291.1612490501378; Thu, 04 Feb 2021 18:01:41 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Alan Somers Date: Thu, 4 Feb 2021 19:01:30 -0700 Message-ID: Subject: Re: Page fault in _mca_init during startup To: Konstantin Belousov Cc: Mark Johnston , Matthew Macy , FreeBSD Stable ML X-Rspamd-Queue-Id: 4DWzDk3PMCz3py5 X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; none X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[] Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.34 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 05 Feb 2021 02:01:42 -0000 On Thu, Feb 4, 2021 at 5:59 PM Konstantin Belousov wrote: > On Thu, Feb 04, 2021 at 05:19:43PM -0700, Alan Somers wrote: > > On Thu, Feb 4, 2021 at 4:27 PM Mark Johnston wrote: > > > > > On Fri, Feb 05, 2021 at 12:58:34AM +0200, Konstantin Belousov wrote: > > > > On Thu, Feb 04, 2021 at 01:34:13PM -0800, Matthew Macy wrote: > > > > > On Thu, Feb 4, 2021 at 1:31 PM Alan Somers > > > wrote: > > > > > > > > > > > > After upgrading a machine to FreeBSD, 12.2, it hit the following > > > panic on > > > > > > its first reboot. I suspect that a few other servers have hit > this > > > too, > > > > > > but since it happens before swap is mounted there are no core > dumps, > > > and > > > > > > they usually reboot immediately. The code in question hasn't > > > changed since > > > > > > 2018. The panic happened in cmci_monitor at line 930. Does > anybody > > > have > > > > > > any suggestions for how I could debug further? I can't readily > > > reproduce > > > > > > it, and I can't dump core, but I'd like to investigate it any > way I > > > can. > > > > > > The server in question has dual Xeon Gold 6142 CPUs. > > > > > > > > > > Try this. > > > > > > > > I think that there is no other dependencies in the startup order, but > > > > cannot know it for sure. > > > > > > > > commit 19584e3d3e9606d591fa30999b370ed758960e8c > > > > Author: Konstantin Belousov > > > > Date: Fri Feb 5 00:56:09 2021 +0200 > > > > > > > > x86: init mca before APs are started > > > > > > APs only call mca_init() after they have been released by the BSP > > > though, and that happens later in SI_SUB_SMP. > > > > > > > diff --git a/sys/x86/x86/mca.c b/sys/x86/x86/mca.c > > > > index 03100e77d455..e2bf2673cf69 100644 > > > > --- a/sys/x86/x86/mca.c > > > > +++ b/sys/x86/x86/mca.c > > > > @@ -1371,7 +1371,7 @@ mca_init_bsp(void *arg __unused) > > > > > > > > mca_init(); > > > > } > > > > -SYSINIT(mca_init_bsp, SI_SUB_CPU, SI_ORDER_ANY, mca_init_bsp, NULL); > > > > +SYSINIT(mca_init_bsp, SI_SUB_CPU, SI_ORDER_SECOND, mca_init_bsp, > NULL); > > > > > > > > /* Called when a machine check exception fires. */ > > > > void > > > > > > > kib's patch causes a different problem, and this one is reproducible: > > > > Fatal trap 12: page fault while in kernel mode > > cpuid = 0; apic id = 00 > > fault virtual address = 0x18 > > fault code = supervisor read data, page not present > > instruction pointer = 0x20:0xffffffff8125762c > > stack pointer = 0x28:0xffffffff828dad90 > > frame pointer = 0x28:0xffffffff828dad90 > > code segment = base 0x0, limit 0xfffff, type 0x1b > > = DPL 0, pres 1, long 1, def32 0, gran 1 > > processor eflags = resume, IOPL = 0 > > current process = 0 () > > trap number = 12 > > panic: page fault > > cpuid = 0 > > time = 1 > > KDB: stack backtrace: > > db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame > > 0xffffffff828daa50 > > vpanic() at vpanic+0x17b/frame 0xffffffff828daaa0 > > panic() at panic+0x43/frame 0xffffffff828dab00 > > trap_fatal() at trap_fatal+0x391/frame 0xffffffff828dab60 > > trap_pfault() at trap_pfault+0x4f/frame 0xffffffff828dabb0 > > trap() at trap+0x286/frame 0xffffffff828dacc0 > > calltrap() at calltrap+0x8/frame 0xffffffff828dacc0 > > --- trap 0xc, rip = 0xffffffff8125762c, rsp = 0xffffffff828dad90, rbp = > > 0xffffffff828dad90 --- > > native_lapic_enable_cmc() at native_lapic_enable_cmc+0x1c/frame > > 0xffffffff828dad90 > > _mca_init() at _mca_init+0x94c/frame 0xffffffff828dadd0 > > mi_startup() at mi_startup+0xdf/frame 0xffffffff828dadf0 > > btext() at btext+0x2c > > KDB: enter: panic > > [ thread pid 0 tid 0 ] > > Stopped at kdb_enter+0x37: movq $0,0x12bc396(%rip) > > > > If you're wondering, the panic happens at this point in > > native_lapic_enable_cmc: > > > > apic_id = PCPU_GET(apic_id); > > KASSERT(lapics[apic_id].la_present, > > ("%s: missing APIC %u", __func__, apic_id)); > > lapics[apic_id].la_lvts[APIC_LVT_CMCI].lvt_masked = 0; <- panic here > > lapics[apic_id].la_lvts[APIC_LVT_CMCI].lvt_active = 1; > > if (bootverbose) > > printf("lapic%u: CMCI unmasked\n", apic_id); > > } > > Scratch this patch. > > Do you have INVARIANTS enabled? If not, I am curious if enabling them > would convert that rare page fault into rare "CPU %d has more MC banks" > assert. > > Also might be the output of the > # for x in $(jot $(sysctl -n hw.ncpu) 0) ; do cpucontrol -m 0x179 > /dev/cpuctl$x; done > command will show the issue (0x179 is the MCG_CAP MSR). > You need to load cpuctl(4) if it is not loaded yet. > I don't have INVARIANTS enabled, and I can't enable it on the production servers. However, I can turn those three KASSERTs into VERIFYs and see what happens. Here is what your command shows on the server that panicked: $ for x in $(jot $(sysctl -n hw.ncpu) 0) ; do sudo cpucontrol -m 0x179 /dev/cpuctl$x; done | uniq -c 16 MSR 0x179: 0x00000000 0x0f000c14 16 MSR 0x179: 0x00000000 0x0f000814 -Alan