Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 25 Jul 2019 16:32:39 +0000
From:      James Snow <snow@teardrop.org>
To:        Marco Steinbach <coco@executive-computing.de>, Adam <amvandemore@gmail.com>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: Random panics in 11.0 and 12.0 on J1900
Message-ID:  <20190725163239.GS5965@teardrop.org>
In-Reply-To: <CA%2BtpaK1DhR64iyKqZWBujaz_VhkoH1nv9Ek%2BY8PcKhK27=kz2w@mail.gmail.com> <alpine.BSF.2.21.9999.1907201855470.91670@probsd.c0c0.intra>

next in thread | previous in thread | raw e-mail | index | archive | help
Hi Marco and Adam,

Thanks for the responses. Answers to your questions are inline....

On Sat, Jul 20, 2019 at 06:56:19PM +0200, Marco Steinbach wrote:

> I've outfitted all of them with 4-port Intel PRO/1000 PCIe driven by
> igb(4), and am not using the onboard re(4) NICs.

We use the onboard re(4) NICs and they have been their own problem. It's
possible they are implicated here.

> I can't recall ever seeing a panic like you described. Could you share
> a full dmesg and what mainboard(s) you are using ?

/var/run/dmesg.boot from the 12.0 host that panicked is included below.
The board is a "Q1900G2-M V2.0".


On Wed, Jul 24, 2019 at 07:53:54PM -0500, Adam wrote:

> What is the size of this J1900 set?

Large enough that 1% panicking daily means I'm seeing multiple panics
per day.

> Do you also have J1900 which do not exhibit the problem?

I do have a small set which have not exhibited the problem. They are
about 2.5-3% of the fleet. What makes them unique is they are running 10.3.

There are also some 11.0s which have not panicked, but given that we've
seen hosts go ~620 days before a panic, it's possible they just haven't
panicked yet; they are also a minority of the 11s. (It's also possible
the 10.3s just haven't panicked yet, but as they have been deployed the
longest, that seems less probable with each passing day.)

Personally, I believe this is a hardware problem, but these 10.3s that
don't panic are a big hole in that theory.

> memtest cannot conclusively confirm dimm is good, it is only conclusive on
> bad ones.  You can find more info about others learning this lesson
> here(see extended comments):
> 
> https://superuser.com/questions/547822/how-many-passes-are-enough-with-memtest
> 
> 
> > Two, a small number of systems on the same hardware are running
> > 10.3-RELEASE, and have experienced no panics in their history. Panics
> > have only happened on 11s, and now 12.
> >
> 
> Once upon a time in a hypothetical universe, I had a stick of ram which
> would run on Win98 for very long periods without issue.  It wouldn't even
> boot with Win NT.  After the manufacturer sent the same one back twice, I
> tased it and RMA'd again.  This time, I got a new stick and all was good.
> 
> The point is memory issues can be very subtle and replacing with known good
> modules is the easiest way to be sure.

Duly noted, and I don't disagree, but given your comments about memtest
and confirming memory to be good, how do you get to "known good?"

Thanks for the input. dmesg output follows below.


-Snow


---<<BOOT>>---
Copyright (c) 1992-2018 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
	The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 12.0-RELEASE r341666 GENERIC amd64
FreeBSD clang version 6.0.1 (tags/RELEASE_601/final 335540) (based on LLVM 6.0.1)
CPU: Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz (2000.06-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x30678  Family=0x6  Model=0x37  Stepping=8
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x41d8e3bf<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,SSE4.1,SSE4.2,MOVBE,POPCNT,TSCDLT,RDRAND>
  AMD Features=0x28100800<SYSCALL,NX,RDTSCP,LM>
  AMD Features2=0x101<LAHF,Prefetch>
  Structured Extended Features=0x2282<TSCADJ,SMEP,ERMS,NFPUSG>
  Structured Extended Features3=0xc000000<IBPB,STIBP>
  VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID
  TSC: P-state invariant, performance statistics
real memory  = 8589934592 (8192 MB)
avail memory = 8089657344 (7714 MB)
Event timer "LAPIC" quality 600
ACPI APIC Table: <ALASKA A M I >
WARNING: L1 data cache covers fewer APIC IDs than a core (0 < 1)
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
FreeBSD/SMP: 1 package(s) x 4 core(s)
random: unblocking device.
Firmware Warning (ACPI): 32/64X length mismatch in FADT/Gpe0Block: 128/32 (20181003/tbfadt-748)
ioapic0 <Version 2.0> irqs 0-86 on motherboard
Launching APs: 3 2 1
Timecounter "TSC" frequency 2000056560 Hz quality 1000
random: entropy device external interface
kbd1 at kbdmux0
netmap: loaded module
[ath_hal] loaded
random: registering fast source Intel Secure Key RNG
random: fast provider: "Intel Secure Key RNG"
nexus0
cryptosoft0: <software crypto> on motherboard
acpi0: <ALASKA A M I > on motherboard
acpi0: Power Button (fixed)
unknown: I/O range not supported
cpu0: <ACPI CPU> on acpi0
atrtc0: <AT realtime clock> port 0x70-0x77 on acpi0
atrtc0: Warning: Couldn't map I/O.
atrtc0: registered as a time-of-day clock, resolution 1.000000s
Event timer "RTC" frequency 32768 Hz quality 0
hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff irq 8 on acpi0
Timecounter "HPET" frequency 14318180 Hz quality 950
Event timer "HPET" frequency 14318180 Hz quality 450
Event timer "HPET1" frequency 14318180 Hz quality 440
Event timer "HPET2" frequency 14318180 Hz quality 440
attimer0: <AT timer> port 0x40-0x43,0x50-0x53 irq 0 on acpi0
Timecounter "i8254" frequency 1193182 Hz quality 0
Event timer "i8254" frequency 1193182 Hz quality 100
Timecounter "ACPI-safe" frequency 3579545 Hz quality 850
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x408-0x40b on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
vgapci0: <VGA-compatible display> port 0xf080-0xf087 mem 0xd0000000-0xd03fffff,0xc0000000-0xcfffffff irq 16 at device 2.0 on pci0
vgapci0: Boot video device
ahci0: <AHCI SATA controller> port 0xf070-0xf077,0xf060-0xf063,0xf050-0xf057,0xf040-0xf043,0xf020-0xf03f mem 0xd0816000-0xd08167ff irq 19 at device 19.0 on pci0
ahci0: AHCI v1.30 with 2 3Gbps ports, Port Multiplier not supported
ahcich1: <AHCI channel> at channel 1 on ahci0
xhci0: <Intel BayTrail USB 3.0 controller> mem 0xd0800000-0xd080ffff irq 20 at device 20.0 on pci0
xhci0: 32 bytes context size, 64-bit DMA
xhci0: Port routing mask set to 0xffffffff
usbus0 on xhci0
usbus0: 5.0Gbps Super Speed USB v3.0
pci0: <encrypt/decrypt> at device 26.0 (no driver attached)
hdac0: <Intel BayTrail HDA Controller> mem 0xd0810000-0xd0813fff irq 22 at device 27.0 on pci0
pcib1: <ACPI PCI-PCI bridge> irq 16 at device 28.0 on pci0
pci1: <ACPI PCI bus> on pcib1
re0: <RealTek 8168/8111 B/C/CP/D/DP/E/F/G PCIe Gigabit Ethernet> port 0xe000-0xe0ff mem 0xd0704000-0xd0704fff,0xd0700000-0xd0703fff irq 16 at device 0.0 on pci1
re0: Using 1 MSI message
re0: Chip rev. 0x2c800000
re0: MAC rev. 0x00100000
miibus0: <MII bus> on re0
rgephy0: <RTL8169S/8110S/8211 1000BASE-T media interface> PHY 1 on miibus0
rgephy0:  none, 10baseT, 10baseT-FDX, 10baseT-FDX-flow, 100baseTX, 100baseTX-FDX, 100baseTX-FDX-flow, 1000baseT, 1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, 1000baseT-FDX-flow, 1000baseT-FDX-flow-master, auto, auto-flow
re0: Using defaults for TSO: 65518/35/2048
re0: Ethernet address: 40:62:31:03:e4:1e
re0: netmap queues/slots: TX 1/256, RX 1/256
pcib2: <ACPI PCI-PCI bridge> irq 17 at device 28.1 on pci0
pci2: <ACPI PCI bus> on pcib2
pcib3: <ACPI PCI-PCI bridge> irq 18 at device 28.2 on pci0
pci3: <ACPI PCI bus> on pcib3
re1: <RealTek 8168/8111 B/C/CP/D/DP/E/F/G PCIe Gigabit Ethernet> port 0xd000-0xd0ff mem 0xd0604000-0xd0604fff,0xd0600000-0xd0603fff irq 16 at device 0.0 on pci3
re1: Using 1 MSI message
re1: Chip rev. 0x2c800000
re1: MAC rev. 0x00100000
miibus1: <MII bus> on re1
rgephy1: <RTL8169S/8110S/8211 1000BASE-T media interface> PHY 1 on miibus1
rgephy1:  none, 10baseT, 10baseT-FDX, 10baseT-FDX-flow, 100baseTX, 100baseTX-FDX, 100baseTX-FDX-flow, 1000baseT, 1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, 1000baseT-FDX-flow, 1000baseT-FDX-flow-master, auto, auto-flow
re1: Using defaults for TSO: 65518/35/2048
re1: Ethernet address: 40:62:31:03:e4:1f
re1: netmap queues/slots: TX 1/256, RX 1/256
pcib4: <ACPI PCI-PCI bridge> irq 19 at device 28.3 on pci0
pci4: <ACPI PCI bus> on pcib4
ehci0: <Intel BayTrail USB 2.0 controller> mem 0xd0815000-0xd08153ff irq 23 at device 29.0 on pci0
usbus1: EHCI version 1.0
usbus1 on ehci0
usbus1: 480Mbps High Speed USB v2.0
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
acpi_button0: <Power Button> on acpi0
acpi_button1: <Sleep Button> on acpi0
uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
atkbdc0: <Keyboard controller (i8042)> port 0x60,0x64 irq 1 on acpi0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
atkbd0: [GIANT-LOCKED]
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
sc0: non-PNP ISA device will be removed from GENERIC in FreeBSD 12.
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff pnpid PNP0900 on isa0
est0: <Enhanced SpeedStep Frequency Control> on cpu0
ZFS filesystem version: 5
ZFS storage pool version: features support (5000)
Timecounters tick every 1.000 msec
hdacc0: <Realtek ALC662 HDA CODEC> at cad 0 on hdac0
hdaa0: <Realtek ALC662 Audio Function Group> at nid 1 on hdacc0
pcm0: <Realtek ALC662 (Analog 2.0+HP/2.0)> at nid 27,20 and 24,25 on hdaa0
hdacc1: <Intel (0x2882) HDA CODEC> at cad 2 on hdac0
hdaa1: <Intel (0x2882) Audio Function Group> at nid 1 on hdacc1
pcm1: <Intel (0x2882) (HDMI/DP 8ch)> at nid 4 on hdaa1
ugen1.1: <Intel EHCI root HUB> at usbus1
ugen0.1: <0x8086 XHCI root HUB> at usbus0
ada0 at ahcich1 bus 0 scbus0 target 0 lun 0
ada0: <Hoodisk SSD SBFM01.2> ACS-4 ATA SATA 3.x device
ada0: Serial Number K1DTC7A41233647
ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 122104MB (250069680 512 byte sectors)
uhub0: <Intel EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus1
sysctl_warn_reuse: can't re-use a leaf (dev.uhub.%parent)!
Trying to mount root from zfs:zroot/ROOT/default []...
uhub1: <0x8086 XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus0
Root mount waiting for: usbus1 usbus0
uhub1: 7 ports with 7 removable, self powered
Root mount waiting for: usbus1
Root mount waiting for: usbus1
uhub0: 8 ports with 8 removable, self powered
Root mount waiting for: usbus1
ugen1.2: <vendor 0x8087 product 0x07e6> at usbus1
uhub2 on uhub0
uhub2: <vendor 0x8087 product 0x07e6, class 9/0, rev 2.00/0.14, addr 2> on usbus1
Root mount waiting for: usbus1
uhub2: 4 ports with 4 removable, self powered
ugen1.3: <SINO WEALTH USB KEYBOARD> at usbus1
ukbd0 on uhub2
ukbd0: <SINO WEALTH USB KEYBOARD, class 0/0, rev 1.10/1.00, addr 3> on usbus1
kbd2 at ukbd0
Root mount waiting for: usbus1
ugen1.4: <vendor 0x05e3 USB2.0 Hub> at usbus1
uhub3 on uhub2
uhub3: <vendor 0x05e3 USB2.0 Hub, class 9/0, rev 2.00/60.60, addr 4> on usbus1
uhub3: MTT enabled
Root mount waiting for: usbus1
uhub3: 4 ports with 4 removable, self powered
lo0: link state changed to UP
re0: link state changed to DOWN
re1: link state changed to DOWN
re0: link state changed to UP
uhid0 on uhub2
uhid0: <SINO WEALTH USB KEYBOARD, class 0/0, rev 1.10/1.00, addr 3> on usbus1



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20190725163239.GS5965>