From owner-freebsd-fs@FreeBSD.ORG  Wed Mar  9 14:04:12 2011
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C632B106564A;
	Wed,  9 Mar 2011 14:04:12 +0000 (UTC) (envelope-from mike@sentex.net)
Received: from smarthost1.sentex.ca (smarthost1-6.sentex.ca
	[IPv6:2607:f3e0:0:1::12])
	by mx1.freebsd.org (Postfix) with ESMTP id 7D0EA8FC12;
	Wed,  9 Mar 2011 14:04:12 +0000 (UTC)
Received: from [IPv6:2607:f3e0:0:4:4433:c074:8d7b:b33d]
	([IPv6:2607:f3e0:0:4:4433:c074:8d7b:b33d])
	by smarthost1.sentex.ca (8.14.4/8.14.4) with ESMTP id p29E4Alk016380
	(version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO);
	Wed, 9 Mar 2011 09:04:10 -0500 (EST) (envelope-from mike@sentex.net)
Message-ID: <4D7788D9.50808@sentex.net>
Date: Wed, 09 Mar 2011 09:04:09 -0500
From: Mike Tancsa <mike@sentex.net>
Organization: Sentex Communications
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US;
	rv:1.9.2.13) Gecko/20101207 Thunderbird/3.1.7
MIME-Version: 1.0
To: Stephen McKay <mckay@freebsd.org>
References: <201103081425.p28EPQtM002115@dungeon.home>	<BEBC15BA440AB24484C067A3A9D38D7E014DA66584F0@server7.acsi.ca>
	<201103091241.p29CfUM1003302@dungeon.home>
In-Reply-To: <201103091241.p29CfUM1003302@dungeon.home>
X-Enigmail-Version: 1.1.1
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-Scanned-By: MIMEDefang 2.67 on IPv6:2607:f3e0:0:1::12
Cc: freebsd-fs@freebsd.org
Subject: Re: Constant minor ZFS corruption
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 09 Mar 2011 14:04:12 -0000

On 3/9/2011 7:41 AM, Stephen McKay wrote:
> On Tuesday, 8th March 2011, Chris Forgeron wrote:
> 
>> Have you make sure it's not always the same drives with the checksum
>> errors? It make take a few days to know for sure..
> 
> Of the 12 disks, only 1 has been error-free.  I've been doing this for
> about 10 days now and there is no pattern that I can see in the errors.
> 

We sort of went through something similar to this on our offsite/DR
backup server just last week. I dont have as many disks as you, but


0(offsite)# zpool status
  pool: tank1
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank1       ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            ad0     ONLINE       0     0     0
            ada4    ONLINE       0     0     0
            ad4     ONLINE       0     0     0
            ad6     ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            ada0    ONLINE       0     0     0
            ada1    ONLINE       0     0     0
            ada2    ONLINE       0     0     0
            ada3    ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            ada5    ONLINE       0     0     0
            ada8    ONLINE       0     0     0
            ada7    ONLINE       0     0     0
            ada6    ONLINE       0     0     0

errors: No known data errors
0(offsite)#


After adding a larger case for future expansion, we found the next day
we were seeing all sorts of random errors

Like

Mar  3 05:34:47 offsite kernel: ad1: FAILURE - WRITE_DMA48
status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=2281852580
Mar  3 06:11:59 offsite kernel: ad1: TIMEOUT - WRITE_DMA48 retrying (1
retry left) LBA=2292675553
Mar  3 06:11:59 offsite kernel: ad1: FAILURE - WRITE_DMA48
status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=2292675553
Mar  3 06:23:54 offsite kernel: ad1: TIMEOUT - WRITE_DMA48 retrying (1
retry left) LBA=2292734035
Mar  3 06:23:54 offsite kernel: ad1: FAILURE - WRITE_DMA48
status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=2292734035

and

Mar  4 08:56:15 offsite kernel: siisch1: siis_timeout is 00040000 ss
04000000 rs 04000000 es 00000000 sts 801e2000 serr 00000000
Mar  4 09:18:33 offsite kernel: siisch1: Timeout on slot 26
Mar  4 09:18:33 offsite kernel: siisch1: siis_timeout is 00040000 ss
04000000 rs 04000000 es 00000000 sts 801b2000 serr 00000000
Mar  4 09:21:09 offsite kernel: siisch1: Timeout on slot 26
Mar  4 09:21:09 offsite kernel: siisch1: siis_timeout is 00040000 ss
04000000 rs 04000000 es 00000000 sts 801d2000 serr 00000000
Mar  4 09:22:44 offsite kernel: siisch1: Timeout on slot 26
Mar  4 09:22:44 offsite kernel: siisch1: siis_timeout is 00040000 ss
04000000 rs 04000000 es 00000000 sts 801d2000 serr 00000000
Mar  4 09:23:16 offsite kernel: siisch1: Timeout on slot 30
Mar  4 09:23:16 offsite kernel: siisch1: siis_timeout is 00040000 ss
40000000 rs 40000000 es 00000000 sts 801a2000 serr 00000000

on multiple disks and on multiple controllers... I have disks off the MB
and off 2 PMPs on an sil3124 controller.

We narrowed it down to 2 problems.  Failing / Marginal power supply and
bad SATA cables. After changing the power supply, we still had a few
disks errors.

smartctl said all disks didnt have errors...  Changed the SATA cables,
and those too were fixed.

After almost 5 days of uptime, no problems at all now.  Not one error.

	---Mike


-------------------
Mike Tancsa, tel +1 519 651 3400
Sentex Communications, mike@sentex.net
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada   http://www.tancsa.com/