From owner-freebsd-alpha  Mon Apr  1 15:19:29 2002
Delivered-To: freebsd-alpha@freebsd.org
Received: from harrier.prod.itd.earthlink.net (harrier.mail.pas.earthlink.net [207.217.120.12])
	by hub.freebsd.org (Postfix) with ESMTP id 7834937B417
	for <freebsd-alpha@freebsd.org>; Mon,  1 Apr 2002 15:19:19 -0800 (PST)
Received: from pool0467.cvx40-bradley.dialup.earthlink.net ([216.244.43.212] helo=mindspring.com)
	by harrier.prod.itd.earthlink.net with esmtp (Exim 3.33 #1)
	id 16sB52-0007Dh-00; Mon, 01 Apr 2002 15:19:17 -0800
Message-ID: <3CA8EADE.C11C8DF7@mindspring.com>
Date: Mon, 01 Apr 2002 15:18:54 -0800
From: Terry Lambert <tlambert2@mindspring.com>
X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony}  (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Christian Weisgerber <naddy@mips.inka.de>
Cc: freebsd-alpha@freebsd.org
Subject: Re: Source of "processor correctable error"?
References: <a89rrl$2vek$1@kemoauc.mips.inka.de>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-alpha@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-alpha.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-alpha>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-alpha>
X-Loop: FreeBSD.org

Christian Weisgerber wrote:
> 
> Since the weekend my PC164 has taken to almost continuously spewing
> gobs of
> 
> Warning: received processor correctable error.
> 
> In fact I first noticed this because writing the error messages to
> the serial console took so much time the machine became sluggish.
> I've switched to a graphics console now.
> 
> Anyway, is there a way to narrow down the source of the underlying
> hardware problem?  What are the candidates anyway?  On-chip cache,
> off-chip cache, main memory?

FWIW, if they are correctable, it's complaining about memory
errors which are correctable using the ECC bits, in the use of
ECC memory.

There are generally three causes of this problem which I have
seen in natures:

1)	Thermal cooling of the system is insufficient, which
	introduces thermal related errors (fix: better cooling).

2)	The memory was being clocked faster than the speed it
	was rated to run at (fix: clock it slower or buy more
	expensive memory).

3)	The "ECC" memory was face ECC instead of real ECC, so
	the correction codes were incorrect, either as a result
	of a cheap vendor ripping a buyer off, or a cheap buyer
	not jumpering the system to not use ECC... or the system
	not having the option to be jumpered that way (fix: use
	real ECC memory, and not forgeries).

It's always possible that you have bad RAM, or that the PCI
bus-on time is set to high in the PCI chipset for the amount
of rAM in the system, such that the DRAM referesh is delayed
enough under load that your memory starts losing bits, Etc..

But while there are other possibilities, I have never seen them
personally in nature (with ECC; I've seen the DRAM refresh
starvation with an improperly BIOS programmed Cyrix Media GX
chipset [5532?]).

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-alpha" in the body of the message