Date: Mon, 19 Nov 2018 20:38:02 +0700 From: Eugene Grosbein <eugen@grosbein.net> To: "Patrick M. Hausen" <hausen@punkt.de>, freebsd-stable@freebsd.org Subject: Re: Memory error logged in /var/log/messages Message-ID: <045b0969-2dbc-be3b-0688-246fe1e14f92@grosbein.net> In-Reply-To: <04F0C04D-7DD7-4079-8D2E-9824B69573D3@punkt.de> References: <04F0C04D-7DD7-4079-8D2E-9824B69573D3@punkt.de>
next in thread | previous in thread | raw e-mail | index | archive | help
19.11.2018 20:10, Patrick M. Hausen wrote: > Hi all, > > one of our production servers, 11.2p3 is logging this every couple of minutes: > > Nov 19 11:48:06 ph002 kernel: MCA: CPU 0 COR (5) OVER MS channel 3 memory error > Nov 19 11:48:06 ph002 kernel: MCA: Address 0x1f709a48c0 > Nov 19 11:48:06 ph002 kernel: MCA: Misc 0x90010000040188c > Nov 19 11:48:06 ph002 kernel: MCA: Bank 12, Status 0xcc00010c000800c3 > Nov 19 11:48:06 ph002 kernel: MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000 > Nov 19 11:48:06 ph002 kernel: MCA: Vendor "GenuineIntel", ID 0x406f1, APIC ID 0 > > Address and core varies but it is always bank 12. > > It seems like applications are unaffected, we use, of course ECC memory. > > Is the OS able to work around these errors and just notifies us or is in-memory > data already getting corrupted? > > I’m at a bit of a loss identifying which DIMM might be the cause so I contacted Supermicro > support. They answered: > >> We can't really answer this, we do not know how various OS's map the memory slots. >> Our advise is always to look at IPMI, but if that doesn't log any issues then we're not sure you're looking at a hardware issue. >> >> But assuming the OS looks at the ranks of a module as a bank and you use dual rank memory then it should logically point at DIMMC2. > > They are right on the IPMI (I told them when opening the case) - there’s nothing at all > in the event log. > > Can they be correct that it might not even be a hardware issue? Use sysutils/mcelog port (or package) to decode such MCA logs with "mcelog --no-dmi --ascii" command. For your logs, it reports: > Hardware event. This is not a software error. > CPU 0 BANK 12 > MISC 0 ADDR 0 > MCG status: > MemCtrl: Corrected patrol scrub error > STATUS cc00010c000800c3 MCGSTATUS 0 > MCGCAP 7000c16 APICID 0 SOCKETID 0 > CPUID Vendor Intel Family 6 Model 79 > (Fields were incomplete) Seems like hardware memory error corrected with ECC, so no data corruption (yet). You better replace a module in BANK 12 of CPU 0.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?045b0969-2dbc-be3b-0688-246fe1e14f92>
