Date: Fri, 5 Feb 2021 01:16:35 +0200 From: Konstantin Belousov <kostikbel@gmail.com> To: Alan Somers <asomers@freebsd.org> Cc: Matthew Macy <mmacy@freebsd.org>, FreeBSD Stable ML <stable@freebsd.org> Subject: Re: Page fault in _mca_init during startup Message-ID: <YByAU8Fl998lKc2d@kib.kiev.ua> In-Reply-To: <CAOtMX2jZH6DV%2B91uDpVnMzaunUA5e-ZtCg6CGDeV0mCwntd2rA@mail.gmail.com> References: <CAOtMX2imwP3x-8LBKGFvMJ%2BjuD%2BsH_02yzs9XvMcCHY=jJs86A@mail.gmail.com> <CAPrugNofKuCZmdkb41j%2Bu%2BX0BPV-cK8WjgrBu7akuD=XezseMw@mail.gmail.com> <YBx8GmXvmLnwFYql@kib.kiev.ua> <CAOtMX2jZH6DV%2B91uDpVnMzaunUA5e-ZtCg6CGDeV0mCwntd2rA@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Feb 04, 2021 at 04:05:42PM -0700, Alan Somers wrote: > On Thu, Feb 4, 2021 at 3:58 PM Konstantin Belousov <kostikbel@gmail.com> > wrote: > > > On Thu, Feb 04, 2021 at 01:34:13PM -0800, Matthew Macy wrote: > > > On Thu, Feb 4, 2021 at 1:31 PM Alan Somers <asomers@freebsd.org> wrote: > > > > > > > > After upgrading a machine to FreeBSD, 12.2, it hit the following panic > > on > > > > its first reboot. I suspect that a few other servers have hit this > > too, > > > > but since it happens before swap is mounted there are no core dumps, > > and > > > > they usually reboot immediately. The code in question hasn't changed > > since > > > > 2018. The panic happened in cmci_monitor at line 930. Does anybody > > have > > > > any suggestions for how I could debug further? I can't readily > > reproduce > > > > it, and I can't dump core, but I'd like to investigate it any way I > > can. > > > > The server in question has dual Xeon Gold 6142 CPUs. > > > > > > > > > > I can't actually help :( but I can add a +1 with similar hardware or > > > equivalent specs. It's not frequent, but it's often enough to be > > > annoying. > > > -M > > > > > > > if (!(ctl & MC_CTL2_CMCI_EN)) > > > > /* This bank does not support CMCI. */ > > > > return; > > > > > > > > cc = &cmc_state[PCPU_GET(cpuid)][i]; // <- panic here > > > > > > > > /* Determine maximum threshold. */ > > > > > > > > > > > > Fatal trap 12: page fault while in kernel mode > > > > cpuid = 26; apic id = 34 > > > > fault virtual address = 0xd0 > > > > fault code = supervisor read data, page not present > > > > instruction pointer = 0x20:0xffffffff8125a009 > > > > stack pointer = 0x28:0xfffffe0000b65f20 > > > > frame pointer = 0x28:0xfffffe0000b65f50 > > > > code segment = base 0x0, limit 0xfffff, type 0x1b > > > > = DPL 0, pres 1, long 1, def32 0, gran 1 > > > > processor eflags = resume, IOPL = 0 > > > > current process = 11 (idle: cpu26) > > > > trap number = 12 > > > > panic: page fault > > > > cpuid = 26 > > > > time = 1 > > > > KDB: stack backtrace: > > > > db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame > > > > 0xfffffe0000b65be0 > > > > vpanic() at vpanic+0x17b/frame 0xfffffe0000b65c30 > > > > panic() at panic+0x43/frame 0xfffffe0000b65c90 > > > > trap_fatal() at trap_fatal+0x391/frame 0xfffffe0000b65cf0 > > > > trap_pfault() at trap_pfault+0x4f/frame 0xfffffe0000b65d40 > > > > trap() at trap+0x286/frame 0xfffffe0000b65e50 > > > > calltrap() at calltrap+0x8/frame 0xfffffe0000b65e50 > > > > --- trap 0xc, rip = 0xffffffff8125a009, rsp = 0xfffffe0000b65f20, rbp = > > > > 0xfffffe0000b65f50 --- > > > > _mca_init() at _mca_init+0x5d9/frame 0xfffffe0000b65f50 > > > > init_secondary_tail() at init_secondary_tail+0xfd/frame > > 0xfffffe0000b65f80 > > > > init_secondary() at init_secondary+0x2d1/frame 0xfffffe0000b65ff0 > > > > KDB: enter: panic > > > > [ thread pid 11 tid 100029 ] > > > > Stopped at kdb_enter+0x37: movq $0,0x12bc1f6(%rip) > > > > Try this. > > > > I think that there is no other dependencies in the startup order, but > > cannot know it for sure. > > > > commit 19584e3d3e9606d591fa30999b370ed758960e8c > > Author: Konstantin Belousov <kib@FreeBSD.org> > > Date: Fri Feb 5 00:56:09 2021 +0200 > > > > x86: init mca before APs are started > > > > diff --git a/sys/x86/x86/mca.c b/sys/x86/x86/mca.c > > index 03100e77d455..e2bf2673cf69 100644 > > --- a/sys/x86/x86/mca.c > > +++ b/sys/x86/x86/mca.c > > @@ -1371,7 +1371,7 @@ mca_init_bsp(void *arg __unused) > > > > mca_init(); > > } > > -SYSINIT(mca_init_bsp, SI_SUB_CPU, SI_ORDER_ANY, mca_init_bsp, NULL); > > +SYSINIT(mca_init_bsp, SI_SUB_CPU, SI_ORDER_SECOND, mca_init_bsp, NULL); > > > > /* Called when a machine check exception fires. */ > > void > > > > I can test this patch on development servers, but so far I've only seen the > crash on production servers. Do you have any suggestions for how to force > the crash, or how to test this patch besides simply making sure that my dev > servers can boot? The race, as I see it, is that we call mca_init() on BSP too late, so malloc() that provides the storage for cmc_state array, could be called too late, before one of the APs was IPIed for startup. Patch ensures that mca_init_bsp() SYSINIT is finished before we go to start the APs. I do not think there is any reliable way to trigger the panic while keeping the patch usable, except to observe enough successfull boots.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YByAU8Fl998lKc2d>