From owner-freebsd-stable@FreeBSD.ORG Tue Aug 24 20:03:33 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 63BD61065695 for ; Tue, 24 Aug 2010 20:03:33 +0000 (UTC) (envelope-from avg@icyb.net.ua) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id A20258FC12 for ; Tue, 24 Aug 2010 20:03:32 +0000 (UTC) Received: from porto.topspin.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id XAA13396; Tue, 24 Aug 2010 23:03:27 +0300 (EEST) (envelope-from avg@icyb.net.ua) Received: from localhost.topspin.kiev.ua ([127.0.0.1]) by porto.topspin.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1OnziR-000Huw-JH; Tue, 24 Aug 2010 23:03:27 +0300 Message-ID: <4C74258E.2060403@icyb.net.ua> Date: Tue, 24 Aug 2010 23:03:26 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US; rv:1.9.2.8) Gecko/20100822 Lightning/1.0b2 Thunderbird/3.1.2 MIME-Version: 1.0 To: Artem Belevich References: <4C71CC62.6060803@langille.org> <4C71D756.5080205@langille.org> <4C7218D6.6090408@icyb.net.ua> <201008230820.35260.jhb@freebsd.org> <4C737F85.5010804@icyb.net.ua> In-Reply-To: X-Enigmail-Version: 1.1.2 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: Ronald Klop , freebsd-stable@freebsd.org Subject: Re: kernel MCA messages X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 24 Aug 2010 20:03:33 -0000 on 24/08/2010 22:51 Artem Belevich said the following: > IMHO the key here is whether hardware is broken or not. The only case > where correctable ECC errors are OK is when a bit gets flipped by a > high-energy particle. That's a normal but fairly rare event. If you > get bit flips often enough that you can recall details of more then > one of them on the same hardware, my guess would be that you're > dealing with something else -- bad/marginal memory, signal integrity > issues, power issues, overheating... The list continues.. In all those > cases hardware does *not* work correctly. Whether you can (or want to) > keep running stuff on the hardware that is broken is another question. Have you read the article? :) If not, read at least the summary. > On Tue, Aug 24, 2010 at 1:15 AM, Andriy Gapon wrote: >> on 24/08/2010 09:14 Ronald Klop said the following: >>> >>> A little off topic, but what is 'a low rate of corrected ECC errors'? At work >>> one machine has them like ones per day, but runs ok. Is ones per day much? >> >> That's up to your judgment. It's like after how many remapped sectors do you >> replace HDD. >> You may find this interesting: >> http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf >> >> -- >> Andriy Gapon -- Andriy Gapon