From owner-freebsd-stable@FreeBSD.ORG Mon Jun 26 23:44:41 2006 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 6BA1A16A47B for ; Mon, 26 Jun 2006 23:44:41 +0000 (UTC) (envelope-from dmitry@atlantis.dp.ua) Received: from postman.atlantis.dp.ua (postman.atlantis.dp.ua [193.108.47.1]) by mx1.FreeBSD.org (Postfix) with ESMTP id E8BAC442A1 for ; Mon, 26 Jun 2006 23:21:36 +0000 (GMT) (envelope-from dmitry@atlantis.dp.ua) Received: from smtp.atlantis.dp.ua (smtp.atlantis.dp.ua [193.108.46.231]) by postman.atlantis.dp.ua (8.13.1/8.13.1) with ESMTP id k5QNLSlY017832; Tue, 27 Jun 2006 02:21:28 +0300 (EEST) (envelope-from dmitry@atlantis.dp.ua) Date: Tue, 27 Jun 2006 02:21:28 +0300 (EEST) From: Dmitry Pryanishnikov To: "M.Hirsch" In-Reply-To: <44A068A7.3090403@hirsch.it> Message-ID: <20060627020819.L3403@atlantis.atlantis.dp.ua> References: <20060626100949.G24406@fledge.watson.org> <20060626081029.L1114@ganymede.hub.org> <20060626140333.M38418@fledge.watson.org> <20060626235355.Q95667@atlantis.atlantis.dp.ua> <44A04FD2.1030001@hirsch.it> <20060627011512.N95667@atlantis.atlantis.dp.ua> <44A06233.1090704@hirsch.it> <20060627014335.E87535@atlantis.atlantis.dp.ua> <44A068A7.3090403@hirsch.it> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-stable@freebsd.org Subject: Re: FreeBSD 6.x CVSUP today crashes with zero load ... X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 26 Jun 2006 23:44:41 -0000 On Tue, 27 Jun 2006, M.Hirsch wrote: >> If you're using hardware w/o ECC, it just can't tell whether error present >> or absent. So ECC _is_ the way to detect (not mask) broken hardware. >> > Ok, thanks. I think I understand the meaning of ECC now. > So, unlike my supplier claims, ECC is not supposed to help against hardware > failures. > But it is the way to detect them, right? ECC stands for Error Checking and Correction. It's a hardware feature, and its primary task is Checking (that is, detection) of errors. It just happens that number of additional bits which carry checking code is sufficient to correct _any_ _single-bit_ data error (not mask it, but really correct), and to detect any double-bit and most of several-bit errors (w/o correction). >> Intel's ECC-capable chipset allows it. But if we're speaking about >> production environment, such behaviour (abnormal termination on _corrected_ >> error) is unacceptable. > > "abnormal termination" is not only acceptable for me, it is what I am looking > for. > Make the node crash completely, so one of the others can take over its > task(s). Again, when single-bit correction has happened, it's not fake, the result is actually correct. Why panic the machine immediately if all data OK? Sincerely, Dmitry -- Atlantis ISP, System Administrator e-mail: dmitry@atlantis.dp.ua nic-hdl: LYNX-RIPE