From owner-freebsd-stable@FreeBSD.ORG Tue Aug 24 23:38:53 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D6B7D1065693 for ; Tue, 24 Aug 2010 23:38:53 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta06.westchester.pa.mail.comcast.net (qmta06.westchester.pa.mail.comcast.net [76.96.62.56]) by mx1.freebsd.org (Postfix) with ESMTP id 83D618FC17 for ; Tue, 24 Aug 2010 23:38:52 +0000 (UTC) Received: from omta10.westchester.pa.mail.comcast.net ([76.96.62.28]) by qmta06.westchester.pa.mail.comcast.net with comcast id yPMn1e0040cZkys56PetFF; Tue, 24 Aug 2010 23:38:53 +0000 Received: from koitsu.dyndns.org ([98.248.41.155]) by omta10.westchester.pa.mail.comcast.net with comcast id yPeq1e00N3LrwQ23WPerB4; Tue, 24 Aug 2010 23:38:52 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 10E8E9B425; Tue, 24 Aug 2010 16:38:49 -0700 (PDT) Date: Tue, 24 Aug 2010 16:38:49 -0700 From: Jeremy Chadwick To: Dan Langille Message-ID: <20100824233849.GA35100@icarus.home.lan> References: <4C71CC62.6060803@langille.org> <4C745213.3050004@langille.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4C745213.3050004@langille.org> User-Agent: Mutt/1.5.20 (2009-06-14) Cc: freebsd-stable Subject: Re: kernel MCA messages X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 24 Aug 2010 23:38:53 -0000 On Tue, Aug 24, 2010 at 07:13:23PM -0400, Dan Langille wrote: > On 8/22/2010 9:18 PM, Dan Langille wrote: > >What does this mean? > > > >kernel: MCA: Bank 4, Status 0x940c4001fe080813 > >kernel: MCA: Global Cap 0x0000000000000105, Status 0x0000000000000000 > >kernel: MCA: Vendor "AuthenticAMD", ID 0xf5a, APIC ID 0 > >kernel: MCA: CPU 0 COR BUSLG Source RD Memory > >kernel: MCA: Address 0x7ff6b0 > > > >FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43 > > FYI, these are occurring every hour, almost to the second. e.g. > xx:56:yy, where yy is 09, 10, or 11. > > Checking logs, I don't see anything that correlates with this point > in the hour (i.e 56 minutes past) that doesn't also occur at other > times. > > It seems very odd to occur so regularly. 1) Why haven't you replaced the DIMM in Bank 4 -- or better yet, all the DIMMs just to be sure? Do this and see if the problem goes away. If not, no harm done, and you've narrowed it down. 2) What exact manufacturer and model of motherboard is this? If you can provide a link to a User Manual that would be great. 3) Please go into your system BIOS and find where "ECC ChipKill" options are available (likely under a Memory, Chipset, or Northbridge section). Please write down and provide here all of the options and what their currently selected values are. 4) Please make sure you're running the latest system BIOS. I've seen on certain Rackable AMD-based systems where Northbridge-related features don't work quite right (at least with Solaris), resulting in atrocious memory performance on the system. A BIOS upgrade solved the problem. There's a ChipKill feature called "ECC BG Scrubbing" that's vague in definition, given that it's a "background memory scrub" that happens at intervals which are unknown to me. Maybe 60 minutes? I don't know. This is why I ask question #3. For John and other devs: I assume the decoded MCA messages indicate with absolute certainty that the ECC error is coming from external DRAM and not, say, bad L1 or L2 cache? -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |