From owner-freebsd-stable@freebsd.org Wed Jul 10 16:26:20 2019 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8410615DBB8F for ; Wed, 10 Jul 2019 16:26:20 +0000 (UTC) (envelope-from snow@teardrop.org) Received: from hoopy.teardrop.org (hoopy.teardrop.org [52.27.92.245]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 5980393684 for ; Wed, 10 Jul 2019 16:26:19 +0000 (UTC) (envelope-from snow@teardrop.org) Received: by hoopy.teardrop.org (Postfix, from userid 1002) id DFFC312D198; Wed, 10 Jul 2019 16:26:36 +0000 (UTC) Date: Wed, 10 Jul 2019 16:26:36 +0000 From: James Snow To: freebsd-stable@freebsd.org Subject: Random panics in 11.0 and 12.0 on J1900 Message-ID: <20190710162636.GM5965@teardrop.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.12.0 (2019-05-25) X-Rspamd-Queue-Id: 5980393684 X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; spf=pass (mx1.freebsd.org: domain of snow@teardrop.org designates 52.27.92.245 as permitted sender) smtp.mailfrom=snow@teardrop.org X-Spamd-Result: default: False [-3.87 / 15.00]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-0.99)[-0.986,0]; FROM_HAS_DN(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:52.27.92.245]; TO_MATCH_ENVRCPT_ALL(0.00)[]; MIME_GOOD(-0.10)[text/plain]; TO_DN_NONE(0.00)[]; DMARC_NA(0.00)[teardrop.org]; RCPT_COUNT_ONE(0.00)[1]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; RCVD_TLS_LAST(0.00)[]; MX_GOOD(-0.01)[hoopy.teardrop.org]; NEURAL_HAM_SHORT(-0.62)[-0.621,0]; IP_SCORE(-0.95)[ipnet: 52.24.0.0/14(-3.35), asn: 16509(-1.34), country: US(-0.06)]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:16509, ipnet:52.24.0.0/14, country:US]; MID_RHS_MATCH_FROM(0.00)[]; RCVD_COUNT_TWO(0.00)[2] X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 10 Jul 2019 16:26:20 -0000 I have a set of J1900 hosts running 11.0-RELEASE-p1 that experience seemingly random panics. The panics are all basically the same: Fatal trap 12: page fault while in kernel mode fault code = supervisor read data, page not present Adding workloads to the hosts seems to increase panic frequency, but the panics have also occurred on completely idle hosts. Similarly, uptime when panicking has been as low as minutes, and as high as ~620 days. For reasons, it has not been possible to extract a coredump from these hosts, nor practical to run memtest on them or upgrade them to a newer release. About 1% of our hosts are affected each day, so we've just been living with the problem. However, while testing 12.0 on the same hardware, I encountered the same panic and was able to capture the core dump. (See below.) All of my Google-fu on this panic has turned up threads suggesting the problem is hardware, but there are two problems with that idea... One, memtest has turned up no errors on 12.0 host I witnessed the panic on. Two, a small number of systems on the same hardware are running 10.3-RELEASE, and have experienced no panics in their history. Panics have only happened on 11s, and now 12. kgdb output from the panic follows. (This particular host was in the middle of rebooting when it panicked.) Hoping someone here has some insight. My uninformed wild-ass guess is something relating to spectre/meltdown fixes. Thanks, -Snow root@j1900_12:~ # kgdb /boot/kernel/kernel /var/crash/vmcore.0 GNU gdb (GDB) 8.3 [GDB v8.3 for FreeBSD] Copyright (C) 2019 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-portbld-freebsd12.0". Type "show configuration" for configuration details. For bug reporting instructions, please see: . Find the GDB manual and other documentation resources online at: . For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from /boot/kernel/kernel... Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug... Unread portion of the kernel message buffer: <118>. <118>Terminated <118>Jul 10 07:03:50 j1900_12 syslogd: last message repeated 9 times <118>Jul 10 07:04:08 j1900_12 syslogd: exiting on signal 15 Waiting (max 60 seconds) for system process `vnlru' to stop... done Waiting (max 60 seconds) for system process `syncer' to stop... Syncing disks, vnodes remaining... 0 0 0 0 done Waiting (max 60 seconds) for system thread `bufdaemon' to stop... done Waiting (max 60 seconds) for system thread `bufspacedaemon-0' to stop... done Waiting (max 60 seconds) for system thread `bufspacedaemon-1' to stop... done Waiting (max 60 seconds) for system thread `bufspacedaemon-2' to stop... done Waiting (max 60 seconds) for system thread `bufspacedaemon-3' to stop... done All buffers synced. Uptime: 23h22m43s umass0: detached ukbd0: detached uhid0: detached uhub3: detached uhub2: detached Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x3201c450 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80b7ad1d stack pointer = 0x28:0xfffffe003f231820 frame pointer = 0x28:0xfffffe003f231890 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 1 (init) trap number = 12 panic: page fault cpuid = 0 time = 1562742255 KDB: stack backtrace: #0 0xffffffff80be7977 at kdb_backtrace+0x67 #1 0xffffffff80b9b563 at vpanic+0x1a3 #2 0xffffffff80b9b3b3 at panic+0x43 #3 0xffffffff8107496f at trap_fatal+0x35f #4 0xffffffff810749c9 at trap_pfault+0x49 #5 0xffffffff81073fee at trap+0x29e #6 0xffffffff8104f1d5 at calltrap+0x8 #7 0xffffffff808a6029 at re_shutdown+0x99 #8 0xffffffff80bd878a at bus_generic_shutdown+0x5a #9 0xffffffff80bd878a at bus_generic_shutdown+0x5a #10 0xffffffff80bd878a at bus_generic_shutdown+0x5a #11 0xffffffff80bd878a at bus_generic_shutdown+0x5a #12 0xffffffff80bd878a at bus_generic_shutdown+0x5a #13 0xffffffff80452a8d at acpi_shutdown+0xd #14 0xffffffff80bd878a at bus_generic_shutdown+0x5a #15 0xffffffff80bd878a at bus_generic_shutdown+0x5a #16 0xffffffff80bdbb6e at root_bus_module_handler+0x11e #17 0xffffffff80b7a86f at module_shutdown+0x6f Uptime: 23h22m44s Dumping 494 out of 7976 MB:..4%..13%..23%..33%..43%..52%..62%..72%..81%..91% __curthread () at ./machine/pcpu.h:230 230 ./machine/pcpu.h: No such file or directory. (kgdb) bt #0 __curthread () at ./machine/pcpu.h:230 #1 doadump (textdump=) at /usr/src/sys/kern/kern_shutdown.c:366 #2 0xffffffff80b9b14b in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:446 #3 0xffffffff80b9b5c3 in vpanic (fmt=, ap=0xfffffe003f231570) at /usr/src/sys/kern/kern_shutdown.c:872 #4 0xffffffff80b9b3b3 in panic (fmt=) at /usr/src/sys/kern/kern_shutdown.c:799 #5 0xffffffff8107496f in trap_fatal (frame=0xfffffe003f231760, eva=838976592) at /usr/src/sys/amd64/amd64/trap.c:929 #6 0xffffffff810749c9 in trap_pfault (frame=0xfffffe003f231760, usermode=0) at /usr/src/sys/amd64/amd64/trap.c:765 #7 0xffffffff81073fee in trap (frame=0xfffffe003f231760) at /usr/src/sys/amd64/amd64/trap.c:441 #8 #9 __mtx_lock_sleep (c=0xfffffe00493fa230, v=) at /usr/src/sys/kern/kern_mutex.c:565 #10 0xffffffff808a6029 in re_shutdown (dev=) at /usr/src/sys/dev/re/if_re.c:3772 #11 0xffffffff80bd878a in DEVICE_SHUTDOWN (dev=) at ./device_if.h:262 #12 device_shutdown (dev=0xfffff800037d9100) at /usr/src/sys/kern/subr_bus.c:3065 #13 bus_generic_shutdown (dev=) at /usr/src/sys/kern/subr_bus.c:3760 #14 0xffffffff80bd878a in DEVICE_SHUTDOWN (dev=) at ./device_if.h:262 #15 device_shutdown (dev=0xfffff800037d9200) at /usr/src/sys/kern/subr_bus.c:3065 #16 bus_generic_shutdown (dev=) at /usr/src/sys/kern/subr_bus.c:3760 #17 0xffffffff80bd878a in DEVICE_SHUTDOWN (dev=) at ./device_if.h:262 #18 device_shutdown (dev=0xfffff80003626900) at /usr/src/sys/kern/subr_bus.c:3065 #19 bus_generic_shutdown (dev=) at /usr/src/sys/kern/subr_bus.c:3760 #20 0xffffffff80bd878a in DEVICE_SHUTDOWN (dev=) at ./device_if.h:262 #21 device_shutdown (dev=0xfffff80003627400) at /usr/src/sys/kern/subr_bus.c:3065 #22 bus_generic_shutdown (dev=) at /usr/src/sys/kern/subr_bus.c:3760 #23 0xffffffff80bd878a in DEVICE_SHUTDOWN (dev=) at ./device_if.h:262 #24 device_shutdown (dev=0xfffff8000355f300) at /usr/src/sys/kern/subr_bus.c:3065 #25 bus_generic_shutdown (dev=) at /usr/src/sys/kern/subr_bus.c:3760 #26 0xffffffff80452a8d in acpi_shutdown (dev=0xfffffe00493fa230) at /usr/src/sys/dev/acpica/acpi.c:758 #27 0xffffffff80bd878a in DEVICE_SHUTDOWN (dev=) at ./device_if.h:262 #28 device_shutdown (dev=0xfffff80003560400) at /usr/src/sys/kern/subr_bus.c:3065 #29 bus_generic_shutdown (dev=) at /usr/src/sys/kern/subr_bus.c:3760 #30 0xffffffff80bd878a in DEVICE_SHUTDOWN (dev=) at ./device_if.h:262 #31 device_shutdown (dev=0xfffff8000334ea00) at /usr/src/sys/kern/subr_bus.c:3065 #32 bus_generic_shutdown (dev=) at /usr/src/sys/kern/subr_bus.c:3760 #33 0xffffffff80bdbb6e in DEVICE_SHUTDOWN (dev=0xfffff8000337dd00) at ./device_if.h:262 #34 device_shutdown (dev=0xfffff8000337dd00) at /usr/src/sys/kern/subr_bus.c:3065 #35 root_bus_module_handler (mod=, what=, arg=) at /usr/src/sys/kern/subr_bus.c:4951 #36 0xffffffff80b7a86f in module_shutdown (arg1=, arg2=) at /usr/src/sys/kern/kern_module.c:104 #37 0xffffffff80b9b1da in kern_reboot (howto=0) at /usr/src/sys/kern/kern_shutdown.c:449 #38 0xffffffff80b9acb1 in sys_reboot (td=0xfffff80003320580, uap=0xfffff80003320940) at /usr/src/sys/kern/kern_shutdown.c:280 #39 0xffffffff81075449 in syscallenter (td=) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:135 #40 amd64_syscall (td=0xfffff80003320580, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1076 #41 #42 0x0000000000244e4a in ?? () Backtrace stopped: Cannot access memory at address 0x7fffffffe6e8 (kgdb) list *0xffffffff80b7ad1d 0xffffffff80b7ad1d is in __mtx_lock_sleep (/usr/src/sys/kern/kern_mutex.c:565). 560 /* 561 * If the owner is running on another CPU, spin until the 562 * owner stops running or the state of the lock changes. 563 */ 564 owner = lv_mtx_owner(v); 565 if (TD_IS_RUNNING(owner)) { 566 if (LOCK_LOG_TEST(&m->lock_object, 0)) 567 CTR3(KTR_LOCK, 568 "%s: spinning on %p held by %p", 569 __func__, m, owner); (kgdb)