Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 5 Feb 2021 02:59:16 +0200
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Alan Somers <asomers@freebsd.org>
Cc:        Mark Johnston <markj@freebsd.org>, Matthew Macy <mmacy@freebsd.org>, FreeBSD Stable ML <stable@freebsd.org>
Subject:   Re: Page fault in _mca_init during startup
Message-ID:  <YByYZDEbGlSsgcwv@kib.kiev.ua>
In-Reply-To: <CAOtMX2gzaSgL1SosoTYaVqWYVHALpnFSpDQQu1w%2BBEwkO_g=AQ@mail.gmail.com>
References:  <CAOtMX2imwP3x-8LBKGFvMJ%2BjuD%2BsH_02yzs9XvMcCHY=jJs86A@mail.gmail.com> <CAPrugNofKuCZmdkb41j%2Bu%2BX0BPV-cK8WjgrBu7akuD=XezseMw@mail.gmail.com> <YBx8GmXvmLnwFYql@kib.kiev.ua> <YByC1ZXP5sNE6aHj@raichu> <CAOtMX2gzaSgL1SosoTYaVqWYVHALpnFSpDQQu1w%2BBEwkO_g=AQ@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Feb 04, 2021 at 05:19:43PM -0700, Alan Somers wrote:
> On Thu, Feb 4, 2021 at 4:27 PM Mark Johnston <markj@freebsd.org> wrote:
> 
> > On Fri, Feb 05, 2021 at 12:58:34AM +0200, Konstantin Belousov wrote:
> > > On Thu, Feb 04, 2021 at 01:34:13PM -0800, Matthew Macy wrote:
> > > > On Thu, Feb 4, 2021 at 1:31 PM Alan Somers <asomers@freebsd.org>
> > wrote:
> > > > >
> > > > > After upgrading a machine to FreeBSD, 12.2, it hit the following
> > panic on
> > > > > its first reboot.  I suspect that a few other servers have hit this
> > too,
> > > > > but since it happens before swap is mounted there are no core dumps,
> > and
> > > > > they usually reboot immediately.  The code in question hasn't
> > changed since
> > > > > 2018.  The panic happened in cmci_monitor at line 930.  Does anybody
> > have
> > > > > any suggestions for how I could debug further?  I can't readily
> > reproduce
> > > > > it, and I can't dump core, but I'd like to investigate it any way I
> > can.
> > > > > The server in question has dual Xeon Gold 6142 CPUs.
> > > > >
> > > Try this.
> > >
> > > I think that there is no other dependencies in the startup order, but
> > > cannot know it for sure.
> > >
> > > commit 19584e3d3e9606d591fa30999b370ed758960e8c
> > > Author: Konstantin Belousov <kib@FreeBSD.org>
> > > Date:   Fri Feb 5 00:56:09 2021 +0200
> > >
> > >     x86: init mca before APs are started
> >
> > APs only call mca_init() after they have been released by the BSP
> > though, and that happens later in SI_SUB_SMP.
> >
> > > diff --git a/sys/x86/x86/mca.c b/sys/x86/x86/mca.c
> > > index 03100e77d455..e2bf2673cf69 100644
> > > --- a/sys/x86/x86/mca.c
> > > +++ b/sys/x86/x86/mca.c
> > > @@ -1371,7 +1371,7 @@ mca_init_bsp(void *arg __unused)
> > >
> > >       mca_init();
> > >  }
> > > -SYSINIT(mca_init_bsp, SI_SUB_CPU, SI_ORDER_ANY, mca_init_bsp, NULL);
> > > +SYSINIT(mca_init_bsp, SI_SUB_CPU, SI_ORDER_SECOND, mca_init_bsp, NULL);
> > >
> > >  /* Called when a machine check exception fires. */
> > >  void
> >
> 
> kib's patch causes a different problem, and this one is reproducible:
> 
>  Fatal trap 12: page fault while in kernel mode
> cpuid = 0; apic id = 00
> fault virtual address = 0x18
> fault code = supervisor read data, page not present
> instruction pointer = 0x20:0xffffffff8125762c
> stack pointer        = 0x28:0xffffffff828dad90
> frame pointer        = 0x28:0xffffffff828dad90
> code segment = base 0x0, limit 0xfffff, type 0x1b
> = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags = resume, IOPL = 0
> current process = 0 ()
> trap number = 12
> panic: page fault
> cpuid = 0
> time = 1
> KDB: stack backtrace:
> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
> 0xffffffff828daa50
> vpanic() at vpanic+0x17b/frame 0xffffffff828daaa0
> panic() at panic+0x43/frame 0xffffffff828dab00
> trap_fatal() at trap_fatal+0x391/frame 0xffffffff828dab60
> trap_pfault() at trap_pfault+0x4f/frame 0xffffffff828dabb0
> trap() at trap+0x286/frame 0xffffffff828dacc0
> calltrap() at calltrap+0x8/frame 0xffffffff828dacc0
> --- trap 0xc, rip = 0xffffffff8125762c, rsp = 0xffffffff828dad90, rbp =
> 0xffffffff828dad90 ---
> native_lapic_enable_cmc() at native_lapic_enable_cmc+0x1c/frame
> 0xffffffff828dad90
> _mca_init() at _mca_init+0x94c/frame 0xffffffff828dadd0
> mi_startup() at mi_startup+0xdf/frame 0xffffffff828dadf0
> btext() at btext+0x2c
> KDB: enter: panic
> [ thread pid 0 tid 0 ]
> Stopped at      kdb_enter+0x37: movq    $0,0x12bc396(%rip)
> 
> If you're wondering, the panic happens at this point in
> native_lapic_enable_cmc:
> 
> apic_id = PCPU_GET(apic_id);
> KASSERT(lapics[apic_id].la_present,
>    ("%s: missing APIC %u", __func__, apic_id));
> lapics[apic_id].la_lvts[APIC_LVT_CMCI].lvt_masked = 0;    <- panic here
> lapics[apic_id].la_lvts[APIC_LVT_CMCI].lvt_active = 1;
> if (bootverbose)
> printf("lapic%u: CMCI unmasked\n", apic_id);
> }

Scratch this patch.

Do you have INVARIANTS enabled?  If not, I am curious if enabling them
would convert that rare page fault into rare "CPU %d has more MC banks"
assert.

Also might be the output of the
# for x in $(jot $(sysctl -n hw.ncpu) 0) ; do cpucontrol -m 0x179 /dev/cpuctl$x; done
command will show the issue (0x179 is the MCG_CAP MSR).
You need to load cpuctl(4) if it is not loaded yet.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YByYZDEbGlSsgcwv>