From owner-freebsd-current@freebsd.org Tue Aug 14 22:57:50 2018 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9DD571068B48 for ; Tue, 14 Aug 2018 22:57:50 +0000 (UTC) (envelope-from freebsd@grem.de) Received: from mail.grem.de (outcast.grem.de [213.239.217.27]) by mx1.freebsd.org (Postfix) with SMTP id F37538CD3A for ; Tue, 14 Aug 2018 22:57:49 +0000 (UTC) (envelope-from freebsd@grem.de) Received: (qmail 55081 invoked by uid 89); 14 Aug 2018 22:51:07 -0000 Received: from unknown (HELO bsd64.grem.de) (mg@grem.de@46.244.231.99) by mail.grem.de with ESMTPA; 14 Aug 2018 22:51:07 -0000 Date: Wed, 15 Aug 2018 00:51:06 +0200 From: Michael Gmelin To: Konstantin Belousov Cc: Michael Gmelin , "freebsd-current@freebsd.org" , Matthias Apitz , jhb@freebsd.org Subject: Re: Fatal trap 12: page fault on Acer Chromebook 720 (peppy) Message-ID: <20180815005106.69402d23@bsd64.grem.de> In-Reply-To: <20180606010625.62632920@bsd64.grem.de> References: <20180603144840.44bfea41@bsd64.grem.de> <20180603132110.GP3789@kib.kiev.ua> <20180603165500.361ec894@bsd64.grem.de> <20180603150423.GQ3789@kib.kiev.ua> <20180603215020.452a81d8@bsd64.grem.de> <20180603205340.GS3789@kib.kiev.ua> <20180604004632.56ca6afa@bsd64.grem.de> <20180604110654.GA2450@kib.kiev.ua> <20180604231756.2ed2adb9@bsd64.grem.de> <20180605131135.GH2450@kib.kiev.ua> <20180606010625.62632920@bsd64.grem.de> X-Mailer: Claws Mail 3.15.1 (GTK+ 2.24.31; amd64-portbld-freebsd10.3) X-Face: $wrgCtfdVw_H9WAY?S&9+/F"!41z'L$uo*WzT8miX?kZ~W~Lr5W7v?j0Sde\mwB&/ypo^}> +a'4xMc^^KroE~+v^&^#[B">soBo1y6(TW6#UZiC]o>C6`ej+i Face: iVBORw0KGgoAAAANSUhEUgAAADAAAAAwBAMAAAClLOS0AAAAJFBMVEWJBwe5BQDl LASZU0/LTEWEfHbyj0Txi32+sKrp1Mv944X8/fm1rS+cAAAACXBIWXMAAAsTAAAL EwEAmpwYAAAAB3RJTUUH3wESCxwC7OBhbgAAACFpVFh0Q29tbWVudAAAAAAAQ3Jl YXRlZCB3aXRoIFRoZSBHSU1QbbCXAAAAAghJREFUOMu11DFvEzEUAGCfEhBVFzuq AKkLd0O6VrIQsLXVSZXoWE5N1K3DobBBA9fQpRWc8OkWouaIjedWKiyREOKs+3PY fvalCNjgLVHeF7/3bMtBzV8C/VsQ8tecEgCcDgrzjekwKZ7TwsJZd/ywEKwwP+ZM 8P3drTsAwWn2mpWuDDuYiK1bFs6De0KUUFw0tWxm+D4AIhuuvZqtyWYeO7jQ4Aea 7jUqI+ixhQoHex4WshEvSXdood7stlv4oSuFOC4tqGcr0NjEqXgV4mMJO38nld4+ xKNxRDon7khyKVqY7YR4d+Cg0OMrkWXZOM7YDkEfKiilCn1qYv4mighZiynuHHOA Wq9QJq+BIES7lMFUtcikMnkDGHUoncA+uHgrP0ctIEqfwLHzeSo+eUA66AqzwN6n 2ZHJhw6Qh/PoyC/QENyEyC/AyNjq74Bs+3UH0xYwzDUC4B97HgLocg1QLYgDDO1v f3UX9Y307Ew4AHh67YAFFsxEpkXwpXY3eIgMhAAE3R19L919nNnuD2wlPcDE3UeT L2ytEICQib9BXgS2fU8PrD82ToYO1OEmMSnYTjSqSv9wdC0tPYC+rQRQD9ESnldF CyqfmiYW+tlALt8gH2xrMdC/youbjzPXEun+/ReXsMCDyve3dZc09fn2Oas8oXGc Jj6/fOeK5UmSMPmf/jL+GD8BEj0k/Fn6IO4AAAAASUVORK5CYII= MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.27 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 14 Aug 2018 22:57:50 -0000 On Wed, 6 Jun 2018 01:06:25 +0200 Michael Gmelin wrote: > On Tue, 5 Jun 2018 16:11:35 +0300 > Konstantin Belousov wrote: > > > On Mon, Jun 04, 2018 at 11:17:56PM +0200, Michael Gmelin wrote: > > > > > > > > > On Mon, 4 Jun 2018 14:06:55 +0300 > > > Konstantin Belousov wrote: > > > > > > > On Mon, Jun 04, 2018 at 12:46:32AM +0200, Michael Gmelin > > > > wrote: > > > > > > [...] > > > > > > > > > This machine comes with it by default (my model was > > > > > > > > > delivered with SeaBIOS 20131018_145217-build121-m2). > > > > > > > > > So I didn't flash anything (didn't feel like bricking > > > > > > > > > it). > > > > > > > > > > > > > > > > > > > > > kernel trap 12 with interrupts disabled > > > > > > > > > > > > > > > > > > > > > > Fatal trap 12: page fault while in kernel mode > > > > > > > > > > > cpuid = 0; apic id = 00 > > > > > > > > > > > fault virtual address = 0xfffff80001000000 > > > > > > > > > > > fault code = supervisor write data, > > > > > > > > > > > protection violation instruction pointer = > > > > > > > > > > > 0x20:Oxffffffff8102955f stack pointer = > > > > > > > > > > > 0x28:0xffffffff82a79be0 frame pointer = > > > > > > > > > > > 0x28:0xffffffff82a79c10 code segment = > > > > > > > > > > > base Ox0, limit Oxfffff, type Ox1b = DPL 0, pres > > > > > > > > > > > 1, long 1, def32 0, gran 1 processor > > > > > > > > > > > eflags = resume, IOPL = 0 current > > > > > > > > > > > process = 0 () [ thread pid 0 tid 0 ] > > > > > > > > > > > Stopped at native_start_all_aps+0x08f: > > > > > > > > > > > movq %rax,(%rsi) > > > > > > > > > > Look up the source line number for this address. > > > > > > > > > > > > > > > > > > > > > > > > > > > > I guess that's sys/amd64/amd64/support.S line 854 (in > > > > > > > > > rdmsr), called by native_start_all_aps. Any additional > > > > > > > > > hints how I can track it down? > > > > > > > > Why did you decided that this is rdmsr_safe() ? First, > > > > > > > > native_start_all_aps() does not call rdmsr, second the > > > > > > > > ddb report clearly indicates that the fault occured > > > > > > > > acessing DMAP in native_start_all_aps(). > > > > > > > > > > > > > > > > Just look up the source line by the address > > > > > > > > native_start_all_aps+0x08f. > > > > > > > > > > > > > > Okay, according to kgbd this should be here: > > > > > > > > > > > > > > https://svnweb.freebsd.org/base/head/sys/amd64/amd64/mp_machdep.c?revision=333368&view=markup#l369 > > > > > > > > > > > > > > 364 > > > > > > > 365 /* Create the initial 1GB replicated page tables */ > > > > > > > 366 for (i = 0; i < 512; i++) { > > > > > > > 367 /* Each slot of the level 4 pages points to > > > > > > > the same level 3 page */ 368 pt4[i] = > > > > > > > (u_int64_t)(uintptr_t)(mptramp_pagetables + PAGE_SIZE); > > > > > > > 369 pt4[i] |= PG_V | PG_RW | PG_U; 370 > > > > > > > 371 /* Each slot of the level 3 pages points to > > > > > > > the same level 2 page */ 372 pt3[i] = > > > > > > > (u_int64_t)(uintptr_t)(mptramp_pagetables + (2 * > > > > > > > PAGE_SIZE)); 373 pt3[i] |= PG_V | PG_RW | PG_U; > > > > > > > 374 375 /* The level 2 page slots are mapped > > > > > > > with 2MB pages for 1GB. */ 376 pt2[i] = i * (2 > > > > > > > * 1024 * 1024); 377 pt2[i] |= PG_V | PG_RW | > > > > > > > PG_PS | PG_U; 378 } > > > > > > > > > > > > > > -m > > > > > > You have fault on write due to read-only mapping of the > > > > > > portion of the direct map, which maps the kernel text. It > > > > > > is consistent with the faulting address. It is not clear > > > > > > if it is something new on your machine, or before the > > > > > > kernel text was silently corrupted, since ro protection is > > > > > > somewhat recent. > > > > > > > > > > > > It seems that mp_bootaddress() selected the bad place for > > > > > > the bootstrap page tables. Even more, we do not include the > > > > > > kernel text into the physmem[] array, so it is not clear > > > > > > how did it happen. This code was also changed recently. > > > > > > > > > > > > Can you add the print of the physmap[] array somewhere > > > > > > before the panic, to see what is the kernel idea of the > > > > > > available memory ? It should be already done if you have > > > > > > serial console and set debug.late_console tunable to > > > > > > 0. > > > > > > > > > > This is a sad little machine without any kind of serial > > > > > console. > > > > > > > > > > Physmap looks like this after calling getmemsize(): > > > > > > > > > > [0]: 0x10000 > > > > > [1]: 0x30000 > > > > > [2]: 0x40000 > > > > > [3]: 0x9e000 > > > > > [4]: 0x100000 > > > > > [5]: 0xf00000 > > > > > [6]: 0x1003000 > > > > > [7]: 0x7bf7a000 > > > > > > > > > > Physical memory chunks logged in cpu_startup are: > > > > > > > > > > 0x0000000000010000 - 0x000000000002ffff, 141072 bytes (32 > > > > > pages) 0x0000000000040000 - 0x000000000009dfff, 385024 bytes > > > > > (94 pages) > > > > These two chunks reports are consistent with the physmap[0-1, > > > > 2-3]. > > > > > 0x0000000000100000 - 0x00000000001fffff, 1048576 bytes (256 > > > > > pages) 0x0000000002c00000 - 0x0000000075467fff, 1921417216 > > > > > bytes (469096 pages) 0x0000000100000000 - 0x00000001005e7fff, > > > > > 6193152 bytes (1512 pages) > > > > But these three looks completely unrelated to the rest of the > > > > physmap, perhaps except the physmap[4]. We allocate boot pages > > > > from the top of the last physmap chunk, but I am certain that we > > > > do not consume that much memory for boot to make physmap[7] from > > > > the last reported address. > > > > > > > > Are you sure that there are no typos in the values above ? > > > > > > Double checked the numbers. I changed it a bit more, > > > so that debug output appears all on one page. Please see here for > > > the results: > > > > > > https://gist.github.com/grembo/cebb9f7e2a98c37a51bee1e508f7d890 > > Ok, I have a guess what is going on. Does the result of the quirks > > end up as hw.physmem tunable passed to kernel ? It seems that there > > is physmap[] element pointing outside the DMAP-mapped region. > > > > Perhaps print the dmap limit too, to see whether I am on the right > > track. > > I didn't print the dmap limit yet, but I tested your patch: > > > > > Try the following change. It lacks i386 bits. > > > > diff --git a/sys/amd64/amd64/machdep.c b/sys/amd64/amd64/machdep.c > > index e5c69ed91fa..bd6bbf04006 100644 > > --- a/sys/amd64/amd64/machdep.c > > +++ b/sys/amd64/amd64/machdep.c > > @@ -1254,7 +1254,7 @@ getmemsize(caddr_t kmdp, u_int64_t first) > > * in real mode mode (e.g. SMP bare metal). > > */ > > if (init_ops.mp_bootaddress) > > - init_ops.mp_bootaddress(physmap, &physmap_idx); > > + init_ops.mp_bootaddress(physmap, &physmap_idx, > > first); > > /* > > * Maxmem isn't the "maximum memory", it's one larger than > > the diff --git a/sys/amd64/amd64/mp_machdep.c > > b/sys/amd64/amd64/mp_machdep.c index 30146142087..292a6cefa91 100644 > > --- a/sys/amd64/amd64/mp_machdep.c > > +++ b/sys/amd64/amd64/mp_machdep.c > > @@ -103,7 +103,8 @@ static int start_ap(int apic_id); > > * Calculate usable address in base memory for AP trampoline code. > > */ > > void > > -mp_bootaddress(vm_paddr_t *physmap, unsigned int *physmap_idx) > > +mp_bootaddress(vm_paddr_t *physmap, unsigned int *physmap_idx, > > + vm_paddr_t dmap_limit) > > { > > unsigned int i; > > bool allocated; > > @@ -117,8 +118,9 @@ mp_bootaddress(vm_paddr_t *physmap, unsigned int > > *physmap_idx) > > * store the initial page tables. Note that it > > needs to be > > * aligned to a page boundary. > > */ > > - if (physmap[i] >= GiB(4) || > > - (physmap[i + 1] - round_page(physmap[i])) < > > (PAGE_SIZE * 3)) > > + if (physmap[i] >= GiB(4) || physmap[i + 1] - > > + round_page(physmap[i]) < PAGE_SIZE * 3 || > > + physmap[i + 1] - PAGE_SIZE * 3 > dmap_limit) > > continue; > > > > allocated = true; > > diff --git a/sys/amd64/include/smp.h b/sys/amd64/include/smp.h > > index 2ecfe62cf9f..24f0580fe51 100644 > > --- a/sys/amd64/include/smp.h > > +++ b/sys/amd64/include/smp.h > > @@ -58,7 +58,7 @@ void invlpg_pcid_handler(void); > > void invlrng_invpcid_handler(void); > > void invlrng_pcid_handler(void); > > int native_start_all_aps(void); > > -void mp_bootaddress(vm_paddr_t *, unsigned int *); > > +void mp_bootaddress(vm_paddr_t *, unsigned int *, > > vm_paddr_t); > > #endif /* !LOCORE */ > > #endif /* SMP */ > > diff --git a/sys/x86/include/init.h b/sys/x86/include/init.h > > index 880cabaa949..58bbe0a5fd6 100644 > > --- a/sys/x86/include/init.h > > +++ b/sys/x86/include/init.h > > @@ -41,7 +41,7 @@ struct init_ops { > > void (*early_clock_source_init)(void); > > void (*early_delay)(int); > > void (*parse_memmap)(caddr_t, vm_paddr_t *, int *); > > - void (*mp_bootaddress)(vm_paddr_t *, unsigned int > > *); > > + void (*mp_bootaddress)(vm_paddr_t *, unsigned int *, > > vm_paddr_t); int (*start_all_aps)(void); > > void (*msi_init)(void); > > }; > > With the patch I could boot without problems and the machine appears > to be stable (ran some high load & memory intensive tests - by the > way, the machine only has 2gb of ram [even though 4g are reported on > boot - usable memory appears to be reported ok]). > > Thanks, > Michael > Hi, Reviving this old thread, since I just updated to r337818 and a similar problem is happening again. Since the fix in r334799 (review https://reviews.freebsd.org/D15675) (mp_)machdep.c have been touched, so maybe this is related (https://svnweb.freebsd.org/base?view=revision&revision=334799). Please see the screenshot of the panic below: https://gist.github.com/grembo/78d0f2a100dd4f16775b85a118769658 This is me not digging any deeper, hoping that this is something obvious. Please let me know if you need more input. Thanks, Michael -- Michael Gmelin