From owner-freebsd-stable@FreeBSD.ORG Wed Aug 25 00:42:00 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5FF5D1065673 for ; Wed, 25 Aug 2010 00:42:00 +0000 (UTC) (envelope-from dan@langille.org) Received: from nyi.unixathome.org (nyi.unixathome.org [64.147.113.42]) by mx1.freebsd.org (Postfix) with ESMTP id 225DE8FC16 for ; Wed, 25 Aug 2010 00:41:59 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by nyi.unixathome.org (Postfix) with ESMTP id 6A86150BAB; Wed, 25 Aug 2010 01:41:59 +0100 (BST) X-Virus-Scanned: amavisd-new at unixathome.org Received: from nyi.unixathome.org ([127.0.0.1]) by localhost (nyi.unixathome.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id aJEThQZD-8ac; Wed, 25 Aug 2010 01:41:57 +0100 (BST) Received: from smtp-auth.unixathome.org (smtp-auth.unixathome.org [10.4.7.7]) (Authenticated sender: hidden) by nyi.unixathome.org (Postfix) with ESMTPSA id C2B7A50BA4 ; Wed, 25 Aug 2010 01:41:57 +0100 (BST) Message-ID: <4C7466D0.7080200@langille.org> Date: Tue, 24 Aug 2010 20:41:52 -0400 From: Dan Langille Organization: The FreeBSD Diary User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2.8) Gecko/20100802 Thunderbird/3.1.2 MIME-Version: 1.0 To: Jeremy Chadwick References: <4C71CC62.6060803@langille.org> <4C745213.3050004@langille.org> <20100824233849.GA35100@icarus.home.lan> In-Reply-To: <20100824233849.GA35100@icarus.home.lan> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-stable Subject: Re: kernel MCA messages X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Aug 2010 00:42:00 -0000 On 8/24/2010 7:38 PM, Jeremy Chadwick wrote: > On Tue, Aug 24, 2010 at 07:13:23PM -0400, Dan Langille wrote: >> On 8/22/2010 9:18 PM, Dan Langille wrote: >>> What does this mean? >>> >>> kernel: MCA: Bank 4, Status 0x940c4001fe080813 >>> kernel: MCA: Global Cap 0x0000000000000105, Status 0x0000000000000000 >>> kernel: MCA: Vendor "AuthenticAMD", ID 0xf5a, APIC ID 0 >>> kernel: MCA: CPU 0 COR BUSLG Source RD Memory >>> kernel: MCA: Address 0x7ff6b0 >>> >>> FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43 >> >> FYI, these are occurring every hour, almost to the second. e.g. >> xx:56:yy, where yy is 09, 10, or 11. >> >> Checking logs, I don't see anything that correlates with this point >> in the hour (i.e 56 minutes past) that doesn't also occur at other >> times. >> >> It seems very odd to occur so regularly. > > 1) Why haven't you replaced the DIMM in Bank 4 -- or better yet, all > the DIMMs just to be sure? Do this and see if the problem goes > away. If not, no harm done, and you've narrowed it down. For good reason: time and distance. I've not hand the time or opportunity to buy new RAM. Today is Tuesday. The problem appeared about 48 hours ago after upgrading to 8.1 stable from 7.x. The box is in Austin. I'm in Philadelphia. You know the math. ;) When I can get the time to fly to Austin, I will if required. I'm sorry, I'm not meaning to be flippant. I'm just glad I documented as such as I could 4 years ago. > 2) What exact manufacturer and model of motherboard is this? If > you can provide a link to a User Manual that would be great. This is a box from iXsystems that I obtained back when 6.1-RELEASE was the latest. I know it has four sticks of 2GB. http://www.freebsddiary.org/dual-opteron.php Sadly, many of the links are now invalid. The board is a AccelerTech ATO2161-DC, also known as a RioWorks HDAMA-G. See also: http://www.freebsddiary.org/dual-opteron-dmidecode.txt And we have a close up of the RAM and the m/b: http://www.freebsddiary.org/showpicture.php?id=85 http://www.freebsddiary.org/showpicture.php?id=84 I am quite sure it's very close to this: http://www.accelertech.com/2007/amd_mb/opteron/ato2161i-dc_pic.php With the manual here: http://www.accelertech.com/2007/amd_mb/opteron/ato2161i-dc_manual.php > 3) Please go into your system BIOS and find where "ECC ChipKill" > options are available (likely under a Memory, Chipset, or > Northbridge section). Please write down and provide here all > of the options and what their currently selected values are. > > 4) Please make sure you're running the latest system BIOS. I've seen > on certain Rackable AMD-based systems where Northbridge-related > features don't work quite right (at least with Solaris), resulting > in atrocious memory performance on the system. A BIOS upgrade > solved the problem. 3 & 4 are just as hard as #1 at the moment. > There's a ChipKill feature called "ECC BG Scrubbing" that's vague in > definition, given that it's a "background memory scrub" that happens at > intervals which are unknown to me. Maybe 60 minutes? I don't know. > This is why I ask question #3. > > For John and other devs: I assume the decoded MCA messages indicate with > absolute certainty that the ECC error is coming from external DRAM and > not, say, bad L1 or L2 cache? Nice question. -- Dan Langille - http://langille.org/