Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 20 Jul 2019 18:56:19 +0200 (CEST)
From:      Marco Steinbach <coco@executive-computing.de>
To:        James Snow <snow@teardrop.org>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: Random panics in 11.0 and 12.0 on J1900
Message-ID:  <alpine.BSF.2.21.9999.1907201855470.91670@probsd.c0c0.intra>
In-Reply-To: <20190710162636.GM5965@teardrop.org>
References:  <20190710162636.GM5965@teardrop.org>

next in thread | previous in thread | raw e-mail | index | archive | help

> I have a set of J1900 hosts running 11.0-RELEASE-p1 that experience
> seemingly random panics. The panics are all basically the same:
>
> Fatal trap 12: page fault while in kernel mode
> fault code = supervisor read data, page not present
>
> Adding workloads to the hosts seems to increase panic frequency, but the
> panics have also occurred on completely idle hosts. Similarly, uptime
> when panicking has been as low as minutes, and as high as ~620 days.
>
> For reasons, it has not been possible to extract a coredump from these
> hosts, nor practical to run memtest on them or upgrade them to a newer
> release. About 1% of our hosts are affected each day, so we've just been
> living with the problem.
>
> However, while testing 12.0 on the same hardware, I encountered the same
> panic and was able to capture the core dump. (See below.)
>
> All of my Google-fu on this panic has turned up threads suggesting the
> problem is hardware, but there are two problems with that idea...
>
> One, memtest has turned up no errors on 12.0 host I witnessed the panic
> on.
>
> Two, a small number of systems on the same hardware are running
> 10.3-RELEASE, and have experienced no panics in their history. Panics
> have only happened on 11s, and now 12.
>
> kgdb output from the panic follows. (This particular host was in the
> middle of rebooting when it panicked.)
>
> Hoping someone here has some insight. My uninformed wild-ass guess is
> something relating to spectre/meltdown fixes.
>
> Thanks,
>
>
> -Snow

I've been running 10.x, 11.x and 12.0 for a while on several J1900s, namely ASRock Q1900M and Q1900M Pro3 boards.

All of them are getting a good beating on occassion, running for example poudriere on top of GELI and ZFS software RAIDs attached to the onboard 2-port ACHI SATA controller and Marvel based PCIe 4-port SATA controllers.

I've outfitted all of them with 4-port Intel PRO/1000 PCIe driven by igb(4), and am not using the onboard re(4) NICs.

I can't recall ever seeing a panic like you described. Could you share a full dmesg and what mainboard(s) you are using ?

MfG CoCo




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?alpine.BSF.2.21.9999.1907201855470.91670>