From owner-freebsd-current@freebsd.org Mon Jan 4 15:10:45 2016 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9D03DA61030 for ; Mon, 4 Jan 2016 15:10:45 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 7E1841FA2 for ; Mon, 4 Jan 2016 15:10:45 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from ralph.baldwin.cx (c-73-231-226-104.hsd1.ca.comcast.net [73.231.226.104]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 73AF2B91E; Mon, 4 Jan 2016 10:10:44 -0500 (EST) From: John Baldwin To: freebsd-current@freebsd.org Cc: Steven Hartland Subject: Re: FreeBsd MCA Panic Crash !! Date: Mon, 04 Jan 2016 07:10:18 -0800 Message-ID: <7090189.HS4ZXl3oYZ@ralph.baldwin.cx> User-Agent: KMail/4.14.3 (FreeBSD/10.2-STABLE; KDE/4.14.3; amd64; ; ) In-Reply-To: <568A7F0F.6060307@multiplay.co.uk> References: <1451903649383-6064691.post@n5.nabble.com> <568A7F0F.6060307@multiplay.co.uk> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Mon, 04 Jan 2016 10:10:44 -0500 (EST) X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Jan 2016 15:10:45 -0000 On Monday, January 04, 2016 02:17:51 PM Steven Hartland wrote: > Bank 5 seems to be common to all the crashes, which may suggest you have > some dodgy ram or possibly the driving CPU's memory controller. No, this has nothing to do with that. Bank 5 means that it is bank 5 of the Machine check registers in the processor that are triggering the errors (MC5_*). Different "banks" of the MC registers handle errors for different parts of the hardware (and this varies by CPU). For example, on Nehalem CPUs, the memory controller logs errors (e.g. ECC errors) in bank 8, but that has no correlation to the "bank" of DIMMs that the error occurred in. Later Intel CPUs can log the same errors in register banks 8 through 12 (IIRC). Depending on the CPU model, you can determine more info about the error using the CPU manuals (for Intel the SDM). > As the error says this is a Hardware issue. Well, mcelog has this hardcoded and prints this for every MCA just as a matter of course. It isn't selective but assumes every machine check is a hardware error (which they are, though some are warnings for corrected events that you can ignore as the hardware hasn't degraded enough to warrant replacement. However, corrected events don't generate panics, just messages in the logs, and only a subset of corrected events include the "yellow / green" indicators for which you can ignore "green" events. Even corrected ECC errors I would ignore if you get a few events with a count of 1 that don't recur). -- John Baldwin