Date: Wed, 10 Jul 2019 16:26:36 +0000 From: James Snow <snow@teardrop.org> To: freebsd-stable@freebsd.org Subject: Random panics in 11.0 and 12.0 on J1900 Message-ID: <20190710162636.GM5965@teardrop.org>
next in thread | raw e-mail | index | archive | help
I have a set of J1900 hosts running 11.0-RELEASE-p1 that experience seemingly random panics. The panics are all basically the same: Fatal trap 12: page fault while in kernel mode fault code = supervisor read data, page not present Adding workloads to the hosts seems to increase panic frequency, but the panics have also occurred on completely idle hosts. Similarly, uptime when panicking has been as low as minutes, and as high as ~620 days. For reasons, it has not been possible to extract a coredump from these hosts, nor practical to run memtest on them or upgrade them to a newer release. About 1% of our hosts are affected each day, so we've just been living with the problem. However, while testing 12.0 on the same hardware, I encountered the same panic and was able to capture the core dump. (See below.) All of my Google-fu on this panic has turned up threads suggesting the problem is hardware, but there are two problems with that idea... One, memtest has turned up no errors on 12.0 host I witnessed the panic on. Two, a small number of systems on the same hardware are running 10.3-RELEASE, and have experienced no panics in their history. Panics have only happened on 11s, and now 12. kgdb output from the panic follows. (This particular host was in the middle of rebooting when it panicked.) Hoping someone here has some insight. My uninformed wild-ass guess is something relating to spectre/meltdown fixes. Thanks, -Snow root@j1900_12:~ # kgdb /boot/kernel/kernel /var/crash/vmcore.0 GNU gdb (GDB) 8.3 [GDB v8.3 for FreeBSD] Copyright (C) 2019 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-portbld-freebsd12.0". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from /boot/kernel/kernel... Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug... Unread portion of the kernel message buffer: <118>. <118>Terminated <118>Jul 10 07:03:50 j1900_12 syslogd: last message repeated 9 times <118>Jul 10 07:04:08 j1900_12 syslogd: exiting on signal 15 Waiting (max 60 seconds) for system process `vnlru' to stop... done Waiting (max 60 seconds) for system process `syncer' to stop... Syncing disks, vnodes remaining... 0 0 0 0 done Waiting (max 60 seconds) for system thread `bufdaemon' to stop... done Waiting (max 60 seconds) for system thread `bufspacedaemon-0' to stop... done Waiting (max 60 seconds) for system thread `bufspacedaemon-1' to stop... done Waiting (max 60 seconds) for system thread `bufspacedaemon-2' to stop... done Waiting (max 60 seconds) for system thread `bufspacedaemon-3' to stop... done All buffers synced. Uptime: 23h22m43s umass0: detached ukbd0: detached uhid0: detached uhub3: detached uhub2: detached Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x3201c450 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80b7ad1d stack pointer = 0x28:0xfffffe003f231820 frame pointer = 0x28:0xfffffe003f231890 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 1 (init) trap number = 12 panic: page fault cpuid = 0 time = 1562742255 KDB: stack backtrace: #0 0xffffffff80be7977 at kdb_backtrace+0x67 #1 0xffffffff80b9b563 at vpanic+0x1a3 #2 0xffffffff80b9b3b3 at panic+0x43 #3 0xffffffff8107496f at trap_fatal+0x35f #4 0xffffffff810749c9 at trap_pfault+0x49 #5 0xffffffff81073fee at trap+0x29e #6 0xffffffff8104f1d5 at calltrap+0x8 #7 0xffffffff808a6029 at re_shutdown+0x99 #8 0xffffffff80bd878a at bus_generic_shutdown+0x5a #9 0xffffffff80bd878a at bus_generic_shutdown+0x5a #10 0xffffffff80bd878a at bus_generic_shutdown+0x5a #11 0xffffffff80bd878a at bus_generic_shutdown+0x5a #12 0xffffffff80bd878a at bus_generic_shutdown+0x5a #13 0xffffffff80452a8d at acpi_shutdown+0xd #14 0xffffffff80bd878a at bus_generic_shutdown+0x5a #15 0xffffffff80bd878a at bus_generic_shutdown+0x5a #16 0xffffffff80bdbb6e at root_bus_module_handler+0x11e #17 0xffffffff80b7a86f at module_shutdown+0x6f Uptime: 23h22m44s Dumping 494 out of 7976 MB:..4%..13%..23%..33%..43%..52%..62%..72%..81%..91% __curthread () at ./machine/pcpu.h:230 230 ./machine/pcpu.h: No such file or directory. (kgdb) bt #0 __curthread () at ./machine/pcpu.h:230 #1 doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:366 #2 0xffffffff80b9b14b in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:446 #3 0xffffffff80b9b5c3 in vpanic (fmt=<optimized out>, ap=0xfffffe003f231570) at /usr/src/sys/kern/kern_shutdown.c:872 #4 0xffffffff80b9b3b3 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:799 #5 0xffffffff8107496f in trap_fatal (frame=0xfffffe003f231760, eva=838976592) at /usr/src/sys/amd64/amd64/trap.c:929 #6 0xffffffff810749c9 in trap_pfault (frame=0xfffffe003f231760, usermode=0) at /usr/src/sys/amd64/amd64/trap.c:765 #7 0xffffffff81073fee in trap (frame=0xfffffe003f231760) at /usr/src/sys/amd64/amd64/trap.c:441 #8 <signal handler called> #9 __mtx_lock_sleep (c=0xfffffe00493fa230, v=<optimized out>) at /usr/src/sys/kern/kern_mutex.c:565 #10 0xffffffff808a6029 in re_shutdown (dev=<optimized out>) at /usr/src/sys/dev/re/if_re.c:3772 #11 0xffffffff80bd878a in DEVICE_SHUTDOWN (dev=<optimized out>) at ./device_if.h:262 #12 device_shutdown (dev=0xfffff800037d9100) at /usr/src/sys/kern/subr_bus.c:3065 #13 bus_generic_shutdown (dev=<optimized out>) at /usr/src/sys/kern/subr_bus.c:3760 #14 0xffffffff80bd878a in DEVICE_SHUTDOWN (dev=<optimized out>) at ./device_if.h:262 #15 device_shutdown (dev=0xfffff800037d9200) at /usr/src/sys/kern/subr_bus.c:3065 #16 bus_generic_shutdown (dev=<optimized out>) at /usr/src/sys/kern/subr_bus.c:3760 #17 0xffffffff80bd878a in DEVICE_SHUTDOWN (dev=<optimized out>) at ./device_if.h:262 #18 device_shutdown (dev=0xfffff80003626900) at /usr/src/sys/kern/subr_bus.c:3065 #19 bus_generic_shutdown (dev=<optimized out>) at /usr/src/sys/kern/subr_bus.c:3760 #20 0xffffffff80bd878a in DEVICE_SHUTDOWN (dev=<optimized out>) at ./device_if.h:262 #21 device_shutdown (dev=0xfffff80003627400) at /usr/src/sys/kern/subr_bus.c:3065 #22 bus_generic_shutdown (dev=<optimized out>) at /usr/src/sys/kern/subr_bus.c:3760 #23 0xffffffff80bd878a in DEVICE_SHUTDOWN (dev=<optimized out>) at ./device_if.h:262 #24 device_shutdown (dev=0xfffff8000355f300) at /usr/src/sys/kern/subr_bus.c:3065 #25 bus_generic_shutdown (dev=<optimized out>) at /usr/src/sys/kern/subr_bus.c:3760 #26 0xffffffff80452a8d in acpi_shutdown (dev=0xfffffe00493fa230) at /usr/src/sys/dev/acpica/acpi.c:758 #27 0xffffffff80bd878a in DEVICE_SHUTDOWN (dev=<optimized out>) at ./device_if.h:262 #28 device_shutdown (dev=0xfffff80003560400) at /usr/src/sys/kern/subr_bus.c:3065 #29 bus_generic_shutdown (dev=<optimized out>) at /usr/src/sys/kern/subr_bus.c:3760 #30 0xffffffff80bd878a in DEVICE_SHUTDOWN (dev=<optimized out>) at ./device_if.h:262 #31 device_shutdown (dev=0xfffff8000334ea00) at /usr/src/sys/kern/subr_bus.c:3065 #32 bus_generic_shutdown (dev=<optimized out>) at /usr/src/sys/kern/subr_bus.c:3760 #33 0xffffffff80bdbb6e in DEVICE_SHUTDOWN (dev=0xfffff8000337dd00) at ./device_if.h:262 #34 device_shutdown (dev=0xfffff8000337dd00) at /usr/src/sys/kern/subr_bus.c:3065 #35 root_bus_module_handler (mod=<optimized out>, what=<optimized out>, arg=<optimized out>) at /usr/src/sys/kern/subr_bus.c:4951 #36 0xffffffff80b7a86f in module_shutdown (arg1=<optimized out>, arg2=<optimized out>) at /usr/src/sys/kern/kern_module.c:104 #37 0xffffffff80b9b1da in kern_reboot (howto=0) at /usr/src/sys/kern/kern_shutdown.c:449 #38 0xffffffff80b9acb1 in sys_reboot (td=0xfffff80003320580, uap=0xfffff80003320940) at /usr/src/sys/kern/kern_shutdown.c:280 #39 0xffffffff81075449 in syscallenter (td=<optimized out>) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:135 #40 amd64_syscall (td=0xfffff80003320580, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1076 #41 <signal handler called> #42 0x0000000000244e4a in ?? () Backtrace stopped: Cannot access memory at address 0x7fffffffe6e8 (kgdb) list *0xffffffff80b7ad1d 0xffffffff80b7ad1d is in __mtx_lock_sleep (/usr/src/sys/kern/kern_mutex.c:565). 560 /* 561 * If the owner is running on another CPU, spin until the 562 * owner stops running or the state of the lock changes. 563 */ 564 owner = lv_mtx_owner(v); 565 if (TD_IS_RUNNING(owner)) { 566 if (LOCK_LOG_TEST(&m->lock_object, 0)) 567 CTR3(KTR_LOCK, 568 "%s: spinning on %p held by %p", 569 __func__, m, owner); (kgdb)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20190710162636.GM5965>