From owner-freebsd-stable@freebsd.org Tue Nov 20 09:08:59 2018 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2D4ED110699F for ; Tue, 20 Nov 2018 09:08:59 +0000 (UTC) (envelope-from bartsch@dssgmbh.de) Received: from dss.incore.de (dss.incore.de [195.145.1.138]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4250F8138F for ; Tue, 20 Nov 2018 09:08:58 +0000 (UTC) (envelope-from bartsch@dssgmbh.de) Received: from inetmail.dmz (inetmail.dmz [10.3.0.3]) by dss.incore.de (Postfix) with ESMTP id AEA1327DFD; Tue, 20 Nov 2018 10:08:50 +0100 (CET) X-Virus-Scanned: amavisd-new at incore.de Received: from dss.incore.de ([10.3.0.3]) by inetmail.dmz (inetmail.dmz [10.3.0.3]) (amavisd-new, port 10024) with LMTP id dD99HFHyoc51; Tue, 20 Nov 2018 10:08:49 +0100 (CET) Received: from mail.local.incore (fwintern.dmz [10.0.0.253]) by dss.incore.de (Postfix) with ESMTP id E0829281FC; Tue, 20 Nov 2018 10:08:40 +0100 (CET) Received: from pcadmin2.incore (pcadmin2.incore [192.168.0.149]) by mail.local.incore (Postfix) with ESMTPSA id DA01210E0; Tue, 20 Nov 2018 10:08:40 +0100 (CET) Subject: Re: Memory error logged in /var/log/messages To: "Patrick M. Hausen" , freebsd-stable@freebsd.org References: <04F0C04D-7DD7-4079-8D2E-9824B69573D3@punkt.de> From: Alfred Bartsch Openpgp: preference=signencrypt Autocrypt: addr=bartsch@dssgmbh.de; prefer-encrypt=mutual; keydata= xsDiBD9Ua94RBACtlucQQWjRxFQVFiZAuVuaiWTLK9hErLP/zQrSqZabPrZ9fOWWscQPOzqd v9wpyvtWhLN/ol5rMaFIg7wvQkZR5Xfa1y893oVOCIQBNNrPT2DQzbXmx9LaySfaZf9Qj/1K TSvLk9TY4Q+vMk8JfmTXZJVu6EEjzl/oz5YuWQSkQwCgwBIH5MDS6L5A05uV/Vu3U4UjHWUD /2hohj2C+3D0s6J6K4ZFfeEq/6oNmpt0sssu7QCK6vwBirmRrfkhljCiQgD5kTVRecNTvdqI ZRwBCq+xupI/0aDZ5p9JnBG/aM1zvU6Fzldm7v948DM6ygkxgvHjV8YJWQLNeoYge7iiwK8/ Y/HlCwOb7V/ItftEzzD7jV6BFsZ0A/92RELB+86oTBgop1RcsCFsTPKqMUoIYkbgfiw/ZvTU aQybKJxZCo4clnzRmV3UBRNhML13KIxRN7upMWJCOwzLLwEdJhcLdOP3XdIENiBle83pC3dc IXILetIbywUbbr0BIYhzL1xViW+9SVL+/v4ZN/ViqUP9i8CQAzpHaKH4wM0uQWxmcmVkIEJh cnRzY2ggKERTUyBHbWJIKSA8YmFydHNjaEBkc3NnbWJoLmRlPsJbBBMRAgAbBQI/VGveBgsJ CAcDAgMVAgMDFgIBAh4BAheAAAoJEOUBntiXVX94Bi4AoKVSCb8jlvduDNah4/3i/gxix62O AKCRfe86AF7awxe9WK3yw9O/NcZbsM7ATQQ/VGvfEAQAggEwHSuszgCjwNgImzdw4ZNr17mO lWipGi7+qWUBNPpu4oOn+pDVQR9c0H/Gx+KutgD6CQRkYQahdxTPCQSWHHjyCAZ4Hb/6yrbP qtvyg3olfb8okRgx1gAKfmF5pQM6lY8q4T7vYr7oCypGYI3KjUhgAKNB5OB4GqTyfzbaLq8A AwUD/15/xwG9JH4bAPQOaOR54jM3yDwrlS7y81cT/99S/eBLWwQTToiwuQri3ueG778Ls7YS h+vTs0ZnrO88q9XOef8tF/vquZYraOVn3+um4s/MVb5pvlBeEcV9vjOmbJzxSLUUCxg3tw1F tjgUFEYk70m7SQO+Kqvx37R9WIE88R1GwkYEGBECAAYFAj9Ua98ACgkQ5QGe2JdVf3jAiQCf cgVLcWC9AHGTxmdyoQPIaTySE0oAni4XcktpKPAmU/tnL8Nd5FeEt1EL Message-ID: <5f345b71-fe30-6b7a-0581-30d0041ff065@dssgmbh.de> Date: Tue, 20 Nov 2018 10:08:40 +0100 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:60.0) Gecko/20100101 Thunderbird/60.3.1 MIME-Version: 1.0 In-Reply-To: <04F0C04D-7DD7-4079-8D2E-9824B69573D3@punkt.de> Content-Type: text/plain; charset=utf-8 Content-Language: de-DE Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 4250F8138F X-Spamd-Result: default: False [1.25 / 15.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; RCVD_COUNT_FIVE(0.00)[5]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; NEURAL_SPAM_SHORT(0.64)[0.641,0]; NEURAL_HAM_LONG(-0.22)[-0.216,0]; MIME_GOOD(-0.10)[text/plain]; DMARC_NA(0.00)[dssgmbh.de]; AUTH_NA(1.00)[]; NEURAL_SPAM_MEDIUM(0.01)[0.014,0]; TO_MATCH_ENVRCPT_SOME(0.00)[]; MX_GOOD(-0.01)[dss.incore.de]; RCPT_COUNT_TWO(0.00)[2]; RCVD_IN_DNSWL_NONE(0.00)[138.1.145.195.list.dnswl.org : 127.0.10.0]; R_SPF_NA(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; RCVD_TLS_LAST(0.00)[]; ASN(0.00)[asn:3320, ipnet:195.145.0.0/16, country:DE]; MID_RHS_MATCH_FROM(0.00)[]; IP_SCORE(-0.08)[asn: 3320(-0.38), country: DE(-0.01)] X-Rspamd-Server: mx1.freebsd.org X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 20 Nov 2018 09:08:59 -0000 Am 19.11.18 um 14:10 schrieb Patrick M. Hausen: > Hi all, > > one of our production servers, 11.2p3 is logging this every couple of minutes: > > Nov 19 11:48:06 ph002 kernel: MCA: CPU 0 COR (5) OVER MS channel 3 memory error > Nov 19 11:48:06 ph002 kernel: MCA: Address 0x1f709a48c0 > Nov 19 11:48:06 ph002 kernel: MCA: Misc 0x90010000040188c > Nov 19 11:48:06 ph002 kernel: MCA: Bank 12, Status 0xcc00010c000800c3 > Nov 19 11:48:06 ph002 kernel: MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000 > Nov 19 11:48:06 ph002 kernel: MCA: Vendor "GenuineIntel", ID 0x406f1, APIC ID 0 > > Address and core varies but it is always bank 12. > > It seems like applications are unaffected, we use, of course ECC memory. > > Is the OS able to work around these errors and just notifies us or is in-memory > data already getting corrupted? > > I’m at a bit of a loss identifying which DIMM might be the cause so I contacted Supermicro > support. They answered: > >> We can't really answer this, we do not know how various OS's map the memory slots. >> Our advise is always to look at IPMI, but if that doesn't log any issues then we're not sure you're looking at a hardware issue. >> >> But assuming the OS looks at the ranks of a module as a bank and you use dual rank memory then it should logically point at DIMMC2. > > They are right on the IPMI (I told them when opening the case) - there’s nothing at all > in the event log. > > Can they be correct that it might not even be a hardware issue? > If not how can I be sure which DIMM is to blame? Spare parts are ready but I’d like to > have a rather short maintenance break outside regular business hours. > > I’ll attach a dmesg.boot. HW is a X10DRW-NT mainboard, SYS-1028R-WTNRT server platform. > > Thanks for any hints, > Patrick > > > _______________________________________________ > freebsd-stable@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org" > Hi Patrick, we had a similar experience with one of our servers (HP DL380 G7): Tons of MCA errors concerning a single memory bank. This bank number did not correspond to a special memory slot (HP numbers them from A to I for each cpu). iLO and mcelog output was not of any help for me. We did not notice any data loss, but to get rid of these annoying messages, I did the following: After taking the server out of production, I removed pairs of memory modules until the MCA messages stopped. Then the last removed pair contained the problematic module. Re-adding one of these last modules left a 50-percent chance to identify the defective module. After replacing this module, the server did no longer complain about memory problems. There should definitely be a more sophisticated method to identify problematic memory modules. Perhaps there is someone on the list who is able to shed some light on this kind of errors. -- Sincerely Alfred Bartsch Data-Service GmbH Beethovenstr. 2A 23617 Stockelsdorf fon: +49 451 490010 fax: +49 451 4900123 Amtsgericht Lübeck, HRB 318 BS Geschäftsführer: Wilfried Paepcke, Dr. Andreas Longwitz, Dr. Hans-Martin Rasch, Dr. Uwe Szyszka