From owner-freebsd-fs@FreeBSD.ORG Thu Mar 10 23:03:35 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 235BA106564A; Thu, 10 Mar 2011 23:03:35 +0000 (UTC) (envelope-from smckay@internode.on.net) Received: from ipmail06.adl6.internode.on.net (ipmail06.adl6.internode.on.net [150.101.137.145]) by mx1.freebsd.org (Postfix) with ESMTP id 68C848FC08; Thu, 10 Mar 2011 23:03:33 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AvsEAOrkeE120DXi/2dsb2JhbACmMnjBPIViBA Received: from ppp118-208-53-226.lns20.bne1.internode.on.net (HELO dungeon.home) ([118.208.53.226]) by ipmail06.adl6.internode.on.net with ESMTP; 11 Mar 2011 09:33:30 +1030 Received: from dungeon.home (localhost [127.0.0.1]) by dungeon.home (8.14.3/8.14.3) with ESMTP id p2AN2hNB002016; Fri, 11 Mar 2011 09:02:43 +1000 (EST) (envelope-from mckay) Message-Id: <201103102302.p2AN2hNB002016@dungeon.home> From: Stephen McKay To: Mike Tancsa References: <201103081425.p28EPQtM002115@dungeon.home> <201103091241.p29CfUM1003302@dungeon.home> <4D7788D9.50808@sentex.net> In-Reply-To: <4D7788D9.50808@sentex.net> from Mike Tancsa at "Wed, 09 Mar 2011 09:04:09 -0500" Date: Fri, 11 Mar 2011 09:02:43 +1000 Sender: smckay@internode.on.net Cc: freebsd-fs@freebsd.org, Stephen McKay Subject: Re: Constant minor ZFS corruption X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 10 Mar 2011 23:03:35 -0000 On Wednesday, 9th March 2011, Mike Tancsa wrote: >On 3/9/2011 7:41 AM, Stephen McKay wrote: >> Of the 12 disks, only 1 has been error-free. I've been doing this for >> about 10 days now and there is no pattern that I can see in the errors. >After adding a larger case for future expansion, we found the next day >we were seeing all sorts of random errors > >Like > >Mar 3 05:34:47 offsite kernel: ad1: FAILURE - WRITE_DMA48 >status=51 error=10 LBA=2281852580 > >and > >Mar 4 08:56:15 offsite kernel: siisch1: siis_timeout is 00040000 ss >04000000 rs 04000000 es 00000000 sts 801e2000 serr 00000000 Our system does not report any driver errors or disk errors. We see checksum errors from ZFS (mostly in scrubs). It's like there's an invisible pixie sprinkling bad data on our disks while we sleep. >We narrowed it down to 2 problems. Failing / Marginal power supply and >bad SATA cables. After changing the power supply, we still had a few >disks errors. If either of these were the cause of our problem, we'd see errors logged, right? Not just invisible corruption? We will probably swap the power supply and cables anyway soon, just to see what happens, but on other machines where cables or power was the problem I saw errors (just like yours) in the logs. >After almost 5 days of uptime, no problems at all now. Not one error. Well, we've got something to aim for, eh? :-) Cheers, Stephen.