From owner-freebsd-stable@freebsd.org Mon Nov 19 13:38:26 2018 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 15EE411040D3 for ; Mon, 19 Nov 2018 13:38:26 +0000 (UTC) (envelope-from eugen@grosbein.net) Received: from hz.grosbein.net (hz.grosbein.net [IPv6:2a01:4f8:d12:604::2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "hz.grosbein.net", Issuer "hz.grosbein.net" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 0D8F475C4A for ; Mon, 19 Nov 2018 13:38:14 +0000 (UTC) (envelope-from eugen@grosbein.net) Received: from eg.sd.rdtc.ru (eg.sd.rdtc.ru [IPv6:2a03:3100:c:13:0:0:0:5]) by hz.grosbein.net (8.15.2/8.15.2) with ESMTPS id wAJDc7Je070167 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 19 Nov 2018 14:38:08 +0100 (CET) (envelope-from eugen@grosbein.net) X-Envelope-From: eugen@grosbein.net X-Envelope-To: hausen@punkt.de Received: from [10.58.0.4] ([10.58.0.4]) by eg.sd.rdtc.ru (8.15.2/8.15.2) with ESMTPS id wAJDc57B023003 (version=TLSv1.2 cipher=DHE-RSA-AES128-SHA bits=128 verify=NOT); Mon, 19 Nov 2018 20:38:06 +0700 (+07) (envelope-from eugen@grosbein.net) Subject: Re: Memory error logged in /var/log/messages To: "Patrick M. Hausen" , freebsd-stable@freebsd.org References: <04F0C04D-7DD7-4079-8D2E-9824B69573D3@punkt.de> From: Eugene Grosbein Message-ID: <045b0969-2dbc-be3b-0688-246fe1e14f92@grosbein.net> Date: Mon, 19 Nov 2018 20:38:02 +0700 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: <04F0C04D-7DD7-4079-8D2E-9824B69573D3@punkt.de> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=0.3 required=5.0 tests=BAYES_00,LOCAL_FROM,SPF_PASS autolearn=no autolearn_force=no version=3.4.2 X-Spam-Report: * -2.3 BAYES_00 BODY: Bayes spam probability is 0 to 1% * [score: 0.0000] * -0.0 SPF_PASS SPF: sender matches SPF record * 2.6 LOCAL_FROM From my domains X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on hz.grosbein.net X-Rspamd-Queue-Id: 0D8F475C4A X-Spamd-Result: default: False [-4.13 / 15.00]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-0.999,0]; MX_INVALID(0.50)[greylisted]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; MIME_GOOD(-0.10)[text/plain]; DMARC_NA(0.00)[grosbein.net]; RCVD_COUNT_THREE(0.00)[3]; TO_MATCH_ENVRCPT_SOME(0.00)[]; R_SPF_PERMFAIL(0.00)[]; RCPT_COUNT_TWO(0.00)[2]; NEURAL_HAM_SHORT(-0.96)[-0.964,0]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; IP_SCORE(-2.57)[ip: (-6.95), ipnet: 2a01:4f8::/29(-3.07), asn: 24940(-2.82), country: DE(-0.01)]; ASN(0.00)[asn:24940, ipnet:2a01:4f8::/29, country:DE]; MID_RHS_MATCH_FROM(0.00)[]; RCVD_TLS_ALL(0.00)[] X-Rspamd-Server: mx1.freebsd.org X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 19 Nov 2018 13:38:26 -0000 19.11.2018 20:10, Patrick M. Hausen wrote: > Hi all, > > one of our production servers, 11.2p3 is logging this every couple of minutes: > > Nov 19 11:48:06 ph002 kernel: MCA: CPU 0 COR (5) OVER MS channel 3 memory error > Nov 19 11:48:06 ph002 kernel: MCA: Address 0x1f709a48c0 > Nov 19 11:48:06 ph002 kernel: MCA: Misc 0x90010000040188c > Nov 19 11:48:06 ph002 kernel: MCA: Bank 12, Status 0xcc00010c000800c3 > Nov 19 11:48:06 ph002 kernel: MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000 > Nov 19 11:48:06 ph002 kernel: MCA: Vendor "GenuineIntel", ID 0x406f1, APIC ID 0 > > Address and core varies but it is always bank 12. > > It seems like applications are unaffected, we use, of course ECC memory. > > Is the OS able to work around these errors and just notifies us or is in-memory > data already getting corrupted? > > I’m at a bit of a loss identifying which DIMM might be the cause so I contacted Supermicro > support. They answered: > >> We can't really answer this, we do not know how various OS's map the memory slots. >> Our advise is always to look at IPMI, but if that doesn't log any issues then we're not sure you're looking at a hardware issue. >> >> But assuming the OS looks at the ranks of a module as a bank and you use dual rank memory then it should logically point at DIMMC2. > > They are right on the IPMI (I told them when opening the case) - there’s nothing at all > in the event log. > > Can they be correct that it might not even be a hardware issue? Use sysutils/mcelog port (or package) to decode such MCA logs with "mcelog --no-dmi --ascii" command. For your logs, it reports: > Hardware event. This is not a software error. > CPU 0 BANK 12 > MISC 0 ADDR 0 > MCG status: > MemCtrl: Corrected patrol scrub error > STATUS cc00010c000800c3 MCGSTATUS 0 > MCGCAP 7000c16 APICID 0 SOCKETID 0 > CPUID Vendor Intel Family 6 Model 79 > (Fields were incomplete) Seems like hardware memory error corrected with ECC, so no data corruption (yet). You better replace a module in BANK 12 of CPU 0.