From owner-freebsd-stable@freebsd.org Thu Jul 25 16:32:12 2019 Return-Path: Delivered-To: freebsd-stable@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 67D5EA867B for ; Thu, 25 Jul 2019 16:32:12 +0000 (UTC) (envelope-from snow@teardrop.org) Received: from hoopy.teardrop.org (hoopy.teardrop.org [52.27.92.245]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 37ABD89F41 for ; Thu, 25 Jul 2019 16:32:11 +0000 (UTC) (envelope-from snow@teardrop.org) Received: by hoopy.teardrop.org (Postfix, from userid 1002) id BC92812D198; Thu, 25 Jul 2019 16:32:39 +0000 (UTC) Date: Thu, 25 Jul 2019 16:32:39 +0000 From: James Snow To: Marco Steinbach , Adam Cc: freebsd-stable@freebsd.org Subject: Re: Random panics in 11.0 and 12.0 on J1900 Message-ID: <20190725163239.GS5965@teardrop.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.12.0 (2019-05-25) X-Rspamd-Queue-Id: 37ABD89F41 X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; spf=pass (mx1.freebsd.org: domain of snow@teardrop.org designates 52.27.92.245 as permitted sender) smtp.mailfrom=snow@teardrop.org X-Spamd-Result: default: False [-4.18 / 15.00]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-0.99)[-0.993,0]; URIBL_BLOCKED(0.00)[superuser.com.multi.uribl.com]; RCPT_COUNT_THREE(0.00)[3]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:52.27.92.245]; FROM_HAS_DN(0.00)[]; MIME_GOOD(-0.10)[text/plain]; RCVD_TLS_LAST(0.00)[]; DMARC_NA(0.00)[teardrop.org]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; TO_MATCH_ENVRCPT_SOME(0.00)[]; MX_GOOD(-0.01)[hoopy.teardrop.org]; NEURAL_HAM_SHORT(-0.93)[-0.933,0]; IP_SCORE(-0.95)[ipnet: 52.24.0.0/14(-3.34), asn: 16509(-1.34), country: US(-0.05)]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:16509, ipnet:52.24.0.0/14, country:US]; MID_RHS_MATCH_FROM(0.00)[]; RCVD_COUNT_TWO(0.00)[2] X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 25 Jul 2019 16:32:12 -0000 Hi Marco and Adam, Thanks for the responses. Answers to your questions are inline.... On Sat, Jul 20, 2019 at 06:56:19PM +0200, Marco Steinbach wrote: > I've outfitted all of them with 4-port Intel PRO/1000 PCIe driven by > igb(4), and am not using the onboard re(4) NICs. We use the onboard re(4) NICs and they have been their own problem. It's possible they are implicated here. > I can't recall ever seeing a panic like you described. Could you share > a full dmesg and what mainboard(s) you are using ? /var/run/dmesg.boot from the 12.0 host that panicked is included below. The board is a "Q1900G2-M V2.0". On Wed, Jul 24, 2019 at 07:53:54PM -0500, Adam wrote: > What is the size of this J1900 set? Large enough that 1% panicking daily means I'm seeing multiple panics per day. > Do you also have J1900 which do not exhibit the problem? I do have a small set which have not exhibited the problem. They are about 2.5-3% of the fleet. What makes them unique is they are running 10.3. There are also some 11.0s which have not panicked, but given that we've seen hosts go ~620 days before a panic, it's possible they just haven't panicked yet; they are also a minority of the 11s. (It's also possible the 10.3s just haven't panicked yet, but as they have been deployed the longest, that seems less probable with each passing day.) Personally, I believe this is a hardware problem, but these 10.3s that don't panic are a big hole in that theory. > memtest cannot conclusively confirm dimm is good, it is only conclusive on > bad ones. You can find more info about others learning this lesson > here(see extended comments): > > https://superuser.com/questions/547822/how-many-passes-are-enough-with-memtest > > > > Two, a small number of systems on the same hardware are running > > 10.3-RELEASE, and have experienced no panics in their history. Panics > > have only happened on 11s, and now 12. > > > > Once upon a time in a hypothetical universe, I had a stick of ram which > would run on Win98 for very long periods without issue. It wouldn't even > boot with Win NT. After the manufacturer sent the same one back twice, I > tased it and RMA'd again. This time, I got a new stick and all was good. > > The point is memory issues can be very subtle and replacing with known good > modules is the easiest way to be sure. Duly noted, and I don't disagree, but given your comments about memtest and confirming memory to be good, how do you get to "known good?" Thanks for the input. dmesg output follows below. -Snow ---<>--- Copyright (c) 1992-2018 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 12.0-RELEASE r341666 GENERIC amd64 FreeBSD clang version 6.0.1 (tags/RELEASE_601/final 335540) (based on LLVM 6.0.1) CPU: Intel(R) Celeron(R) CPU J1900 @ 1.99GHz (2000.06-MHz K8-class CPU) Origin="GenuineIntel" Id=0x30678 Family=0x6 Model=0x37 Stepping=8 Features=0xbfebfbff Features2=0x41d8e3bf AMD Features=0x28100800 AMD Features2=0x101 Structured Extended Features=0x2282 Structured Extended Features3=0xc000000 VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID TSC: P-state invariant, performance statistics real memory = 8589934592 (8192 MB) avail memory = 8089657344 (7714 MB) Event timer "LAPIC" quality 600 ACPI APIC Table: WARNING: L1 data cache covers fewer APIC IDs than a core (0 < 1) FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs FreeBSD/SMP: 1 package(s) x 4 core(s) random: unblocking device. Firmware Warning (ACPI): 32/64X length mismatch in FADT/Gpe0Block: 128/32 (20181003/tbfadt-748) ioapic0 irqs 0-86 on motherboard Launching APs: 3 2 1 Timecounter "TSC" frequency 2000056560 Hz quality 1000 random: entropy device external interface kbd1 at kbdmux0 netmap: loaded module [ath_hal] loaded random: registering fast source Intel Secure Key RNG random: fast provider: "Intel Secure Key RNG" nexus0 cryptosoft0: on motherboard acpi0: on motherboard acpi0: Power Button (fixed) unknown: I/O range not supported cpu0: on acpi0 atrtc0: port 0x70-0x77 on acpi0 atrtc0: Warning: Couldn't map I/O. atrtc0: registered as a time-of-day clock, resolution 1.000000s Event timer "RTC" frequency 32768 Hz quality 0 hpet0: iomem 0xfed00000-0xfed003ff irq 8 on acpi0 Timecounter "HPET" frequency 14318180 Hz quality 950 Event timer "HPET" frequency 14318180 Hz quality 450 Event timer "HPET1" frequency 14318180 Hz quality 440 Event timer "HPET2" frequency 14318180 Hz quality 440 attimer0: port 0x40-0x43,0x50-0x53 irq 0 on acpi0 Timecounter "i8254" frequency 1193182 Hz quality 0 Event timer "i8254" frequency 1193182 Hz quality 100 Timecounter "ACPI-safe" frequency 3579545 Hz quality 850 acpi_timer0: <24-bit timer at 3.579545MHz> port 0x408-0x40b on acpi0 pcib0: port 0xcf8-0xcff on acpi0 pci0: on pcib0 vgapci0: port 0xf080-0xf087 mem 0xd0000000-0xd03fffff,0xc0000000-0xcfffffff irq 16 at device 2.0 on pci0 vgapci0: Boot video device ahci0: port 0xf070-0xf077,0xf060-0xf063,0xf050-0xf057,0xf040-0xf043,0xf020-0xf03f mem 0xd0816000-0xd08167ff irq 19 at device 19.0 on pci0 ahci0: AHCI v1.30 with 2 3Gbps ports, Port Multiplier not supported ahcich1: at channel 1 on ahci0 xhci0: mem 0xd0800000-0xd080ffff irq 20 at device 20.0 on pci0 xhci0: 32 bytes context size, 64-bit DMA xhci0: Port routing mask set to 0xffffffff usbus0 on xhci0 usbus0: 5.0Gbps Super Speed USB v3.0 pci0: at device 26.0 (no driver attached) hdac0: mem 0xd0810000-0xd0813fff irq 22 at device 27.0 on pci0 pcib1: irq 16 at device 28.0 on pci0 pci1: on pcib1 re0: port 0xe000-0xe0ff mem 0xd0704000-0xd0704fff,0xd0700000-0xd0703fff irq 16 at device 0.0 on pci1 re0: Using 1 MSI message re0: Chip rev. 0x2c800000 re0: MAC rev. 0x00100000 miibus0: on re0 rgephy0: PHY 1 on miibus0 rgephy0: none, 10baseT, 10baseT-FDX, 10baseT-FDX-flow, 100baseTX, 100baseTX-FDX, 100baseTX-FDX-flow, 1000baseT, 1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, 1000baseT-FDX-flow, 1000baseT-FDX-flow-master, auto, auto-flow re0: Using defaults for TSO: 65518/35/2048 re0: Ethernet address: 40:62:31:03:e4:1e re0: netmap queues/slots: TX 1/256, RX 1/256 pcib2: irq 17 at device 28.1 on pci0 pci2: on pcib2 pcib3: irq 18 at device 28.2 on pci0 pci3: on pcib3 re1: port 0xd000-0xd0ff mem 0xd0604000-0xd0604fff,0xd0600000-0xd0603fff irq 16 at device 0.0 on pci3 re1: Using 1 MSI message re1: Chip rev. 0x2c800000 re1: MAC rev. 0x00100000 miibus1: on re1 rgephy1: PHY 1 on miibus1 rgephy1: none, 10baseT, 10baseT-FDX, 10baseT-FDX-flow, 100baseTX, 100baseTX-FDX, 100baseTX-FDX-flow, 1000baseT, 1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, 1000baseT-FDX-flow, 1000baseT-FDX-flow-master, auto, auto-flow re1: Using defaults for TSO: 65518/35/2048 re1: Ethernet address: 40:62:31:03:e4:1f re1: netmap queues/slots: TX 1/256, RX 1/256 pcib4: irq 19 at device 28.3 on pci0 pci4: on pcib4 ehci0: mem 0xd0815000-0xd08153ff irq 23 at device 29.0 on pci0 usbus1: EHCI version 1.0 usbus1 on ehci0 usbus1: 480Mbps High Speed USB v2.0 isab0: at device 31.0 on pci0 isa0: on isab0 acpi_button0: on acpi0 acpi_button1: on acpi0 uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 atkbdc0: port 0x60,0x64 irq 1 on acpi0 atkbd0: irq 1 on atkbdc0 kbd0 at atkbd0 atkbd0: [GIANT-LOCKED] sc0: at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x300> sc0: non-PNP ISA device will be removed from GENERIC in FreeBSD 12. vga0: at port 0x3c0-0x3df iomem 0xa0000-0xbffff pnpid PNP0900 on isa0 est0: on cpu0 ZFS filesystem version: 5 ZFS storage pool version: features support (5000) Timecounters tick every 1.000 msec hdacc0: at cad 0 on hdac0 hdaa0: at nid 1 on hdacc0 pcm0: at nid 27,20 and 24,25 on hdaa0 hdacc1: at cad 2 on hdac0 hdaa1: at nid 1 on hdacc1 pcm1: at nid 4 on hdaa1 ugen1.1: at usbus1 ugen0.1: <0x8086 XHCI root HUB> at usbus0 ada0 at ahcich1 bus 0 scbus0 target 0 lun 0 ada0: ACS-4 ATA SATA 3.x device ada0: Serial Number K1DTC7A41233647 ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) ada0: Command Queueing enabled ada0: 122104MB (250069680 512 byte sectors) uhub0: on usbus1 sysctl_warn_reuse: can't re-use a leaf (dev.uhub.%parent)! Trying to mount root from zfs:zroot/ROOT/default []... uhub1: <0x8086 XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus0 Root mount waiting for: usbus1 usbus0 uhub1: 7 ports with 7 removable, self powered Root mount waiting for: usbus1 Root mount waiting for: usbus1 uhub0: 8 ports with 8 removable, self powered Root mount waiting for: usbus1 ugen1.2: at usbus1 uhub2 on uhub0 uhub2: on usbus1 Root mount waiting for: usbus1 uhub2: 4 ports with 4 removable, self powered ugen1.3: at usbus1 ukbd0 on uhub2 ukbd0: on usbus1 kbd2 at ukbd0 Root mount waiting for: usbus1 ugen1.4: at usbus1 uhub3 on uhub2 uhub3: on usbus1 uhub3: MTT enabled Root mount waiting for: usbus1 uhub3: 4 ports with 4 removable, self powered lo0: link state changed to UP re0: link state changed to DOWN re1: link state changed to DOWN re0: link state changed to UP uhid0 on uhub2 uhid0: on usbus1