From owner-freebsd-stable@FreeBSD.ORG Fri Jul 20 22:22:55 2012 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 07461106566C for ; Fri, 20 Jul 2012 22:22:55 +0000 (UTC) (envelope-from snow@teardrop.org) Received: from sputnik.teardrop.org (sputnik.teardrop.org [65.98.88.18]) by mx1.freebsd.org (Postfix) with ESMTP id 7A66A8FC1D for ; Fri, 20 Jul 2012 22:22:54 +0000 (UTC) Received: by sputnik.teardrop.org (Postfix, from userid 30000) id 7E4257E87A; Fri, 20 Jul 2012 22:22:44 +0000 (UTC) Date: Fri, 20 Jul 2012 15:22:44 -0700 From: James Snow To: Dr Josef Karthauser Message-ID: <20120720222244.GA18627@teardrop.org> References: <20120719152909.GL32960@teardrop.org> <002D6A20-D2A4-4909-B2EA-3DB562326050@tao.org.uk> <20120719171548.GM32960@teardrop.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i Cc: "freebsd-stable@freebsd.org" Subject: Re: Checksum errors across ZFS array X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 Jul 2012 22:22:55 -0000 On Fri, Jul 20, 2012 at 04:09:28PM +0100, Dr Josef Karthauser wrote: > Take care though, my system which had been working fine for about > a year when I noticed the ZFS rot (which all appears to be recent > in time). I ran memcheck+ on it for 8 hours or so, and it showed no > errors at all. However, when I replaced the memory with a different > vendor the problems went away. (Reboots and power off/on restarts > hadn't fixed the problem before!). > > So, take care if the memory doesn't report any failures, it might > still be faulty. I've run memtest for about 20 hours now (13 hours in one pass, 7 and counting on the second) and seen no errors. Hrm. > p.s. It was my fault that I wasn't running ECC memory on the system! I am running ECC memory though. If you'd had ECC memory to start do you think you might have seen a different result? In my case, replacing all the RAM and getting a 2nd controller are almost the same cost. Since a second controller will give me the best visibility - or long-term expandability if it turns out not to be the controller - I've gone ahead and ordered one. If I move half the disks to the new controller and continue to see the problems only on the old controller, I know it's the controller or the slot on the motherboard. If the problem continues without any change, I can replace RAM, and then the motherboard. -Snow