Date: Wed, 25 Aug 2010 10:11:29 +0300 From: Andriy Gapon <avg@icyb.net.ua> To: Jeremy Chadwick <freebsd@jdc.parodius.com> Cc: freebsd-stable <freebsd-stable@freebsd.org>, Dan Langille <dan@langille.org> Subject: Re: kernel MCA messages Message-ID: <4C74C221.5020702@icyb.net.ua> In-Reply-To: <20100824233849.GA35100@icarus.home.lan> References: <4C71CC62.6060803@langille.org> <4C745213.3050004@langille.org> <20100824233849.GA35100@icarus.home.lan>
next in thread | previous in thread | raw e-mail | index | archive | help
on 25/08/2010 02:38 Jeremy Chadwick said the following: > On Tue, Aug 24, 2010 at 07:13:23PM -0400, Dan Langille wrote: >> On 8/22/2010 9:18 PM, Dan Langille wrote: >>> What does this mean? >>> >>> kernel: MCA: Bank 4, Status 0x940c4001fe080813 >>> kernel: MCA: Global Cap 0x0000000000000105, Status 0x0000000000000000 >>> kernel: MCA: Vendor "AuthenticAMD", ID 0xf5a, APIC ID 0 >>> kernel: MCA: CPU 0 COR BUSLG Source RD Memory >>> kernel: MCA: Address 0x7ff6b0 >>> >>> FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43 >> >> FYI, these are occurring every hour, almost to the second. e.g. >> xx:56:yy, where yy is 09, 10, or 11. >> >> Checking logs, I don't see anything that correlates with this point >> in the hour (i.e 56 minutes past) that doesn't also occur at other >> times. >> >> It seems very odd to occur so regularly. I still think that everything of essence has already been said in this thread. > 1) Why haven't you replaced the DIMM in Bank 4 -- or better yet, all Bank 4 here is MCA reporting bank, it has nothing to do with RAM slots. Currently on FreeBSD we don't have a standard tool to map physical address to DRAM module, but I am sure that there could be some ways to do it. > the DIMMs just to be sure? Do this and see if the problem goes > away. If not, no harm done, and you've narrowed it down. > > 2) What exact manufacturer and model of motherboard is this? If > you can provide a link to a User Manual that would be great. > > 3) Please go into your system BIOS and find where "ECC ChipKill" > options are available (likely under a Memory, Chipset, or > Northbridge section). Please write down and provide here all > of the options and what their currently selected values are. > > 4) Please make sure you're running the latest system BIOS. I've seen > on certain Rackable AMD-based systems where Northbridge-related > features don't work quite right (at least with Solaris), resulting > in atrocious memory performance on the system. A BIOS upgrade > solved the problem. > > There's a ChipKill feature called "ECC BG Scrubbing" that's vague in > definition, given that it's a "background memory scrub" that happens at > intervals which are unknown to me. Maybe 60 minutes? I don't know. > This is why I ask question #3. > > For John and other devs: I assume the decoded MCA messages indicate with > absolute certainty that the ECC error is coming from external DRAM and > not, say, bad L1 or L2 cache? Have you read the decoded message? Please re-read it. I still recommend reading at least the summary of the RAM ECC research article to make your own judgment about need to replace DRAM. -- Andriy Gapon
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4C74C221.5020702>