Date: Sat, 20 Jul 2019 18:56:19 +0200 (CEST) From: Marco Steinbach <coco@executive-computing.de> To: James Snow <snow@teardrop.org> Cc: freebsd-stable@freebsd.org Subject: Re: Random panics in 11.0 and 12.0 on J1900 Message-ID: <alpine.BSF.2.21.9999.1907201855470.91670@probsd.c0c0.intra> In-Reply-To: <20190710162636.GM5965@teardrop.org> References: <20190710162636.GM5965@teardrop.org>
next in thread | previous in thread | raw e-mail | index | archive | help
> I have a set of J1900 hosts running 11.0-RELEASE-p1 that experience > seemingly random panics. The panics are all basically the same: > > Fatal trap 12: page fault while in kernel mode > fault code = supervisor read data, page not present > > Adding workloads to the hosts seems to increase panic frequency, but the > panics have also occurred on completely idle hosts. Similarly, uptime > when panicking has been as low as minutes, and as high as ~620 days. > > For reasons, it has not been possible to extract a coredump from these > hosts, nor practical to run memtest on them or upgrade them to a newer > release. About 1% of our hosts are affected each day, so we've just been > living with the problem. > > However, while testing 12.0 on the same hardware, I encountered the same > panic and was able to capture the core dump. (See below.) > > All of my Google-fu on this panic has turned up threads suggesting the > problem is hardware, but there are two problems with that idea... > > One, memtest has turned up no errors on 12.0 host I witnessed the panic > on. > > Two, a small number of systems on the same hardware are running > 10.3-RELEASE, and have experienced no panics in their history. Panics > have only happened on 11s, and now 12. > > kgdb output from the panic follows. (This particular host was in the > middle of rebooting when it panicked.) > > Hoping someone here has some insight. My uninformed wild-ass guess is > something relating to spectre/meltdown fixes. > > Thanks, > > > -Snow I've been running 10.x, 11.x and 12.0 for a while on several J1900s, namely ASRock Q1900M and Q1900M Pro3 boards. All of them are getting a good beating on occassion, running for example poudriere on top of GELI and ZFS software RAIDs attached to the onboard 2-port ACHI SATA controller and Marvel based PCIe 4-port SATA controllers. I've outfitted all of them with 4-port Intel PRO/1000 PCIe driven by igb(4), and am not using the onboard re(4) NICs. I can't recall ever seeing a panic like you described. Could you share a full dmesg and what mainboard(s) you are using ? MfG CoCo
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?alpine.BSF.2.21.9999.1907201855470.91670>