From owner-freebsd-fs@FreeBSD.ORG Wed Mar 9 14:04:12 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C632B106564A; Wed, 9 Mar 2011 14:04:12 +0000 (UTC) (envelope-from mike@sentex.net) Received: from smarthost1.sentex.ca (smarthost1-6.sentex.ca [IPv6:2607:f3e0:0:1::12]) by mx1.freebsd.org (Postfix) with ESMTP id 7D0EA8FC12; Wed, 9 Mar 2011 14:04:12 +0000 (UTC) Received: from [IPv6:2607:f3e0:0:4:4433:c074:8d7b:b33d] ([IPv6:2607:f3e0:0:4:4433:c074:8d7b:b33d]) by smarthost1.sentex.ca (8.14.4/8.14.4) with ESMTP id p29E4Alk016380 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Wed, 9 Mar 2011 09:04:10 -0500 (EST) (envelope-from mike@sentex.net) Message-ID: <4D7788D9.50808@sentex.net> Date: Wed, 09 Mar 2011 09:04:09 -0500 From: Mike Tancsa Organization: Sentex Communications User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.13) Gecko/20101207 Thunderbird/3.1.7 MIME-Version: 1.0 To: Stephen McKay References: <201103081425.p28EPQtM002115@dungeon.home> <201103091241.p29CfUM1003302@dungeon.home> In-Reply-To: <201103091241.p29CfUM1003302@dungeon.home> X-Enigmail-Version: 1.1.1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.67 on IPv6:2607:f3e0:0:1::12 Cc: freebsd-fs@freebsd.org Subject: Re: Constant minor ZFS corruption X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 09 Mar 2011 14:04:12 -0000 On 3/9/2011 7:41 AM, Stephen McKay wrote: > On Tuesday, 8th March 2011, Chris Forgeron wrote: > >> Have you make sure it's not always the same drives with the checksum >> errors? It make take a few days to know for sure.. > > Of the 12 disks, only 1 has been error-free. I've been doing this for > about 10 days now and there is no pattern that I can see in the errors. > We sort of went through something similar to this on our offsite/DR backup server just last week. I dont have as many disks as you, but 0(offsite)# zpool status pool: tank1 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM tank1 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 ad0 ONLINE 0 0 0 ada4 ONLINE 0 0 0 ad4 ONLINE 0 0 0 ad6 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 ada0 ONLINE 0 0 0 ada1 ONLINE 0 0 0 ada2 ONLINE 0 0 0 ada3 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 ada5 ONLINE 0 0 0 ada8 ONLINE 0 0 0 ada7 ONLINE 0 0 0 ada6 ONLINE 0 0 0 errors: No known data errors 0(offsite)# After adding a larger case for future expansion, we found the next day we were seeing all sorts of random errors Like Mar 3 05:34:47 offsite kernel: ad1: FAILURE - WRITE_DMA48 status=51 error=10 LBA=2281852580 Mar 3 06:11:59 offsite kernel: ad1: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=2292675553 Mar 3 06:11:59 offsite kernel: ad1: FAILURE - WRITE_DMA48 status=51 error=10 LBA=2292675553 Mar 3 06:23:54 offsite kernel: ad1: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=2292734035 Mar 3 06:23:54 offsite kernel: ad1: FAILURE - WRITE_DMA48 status=51 error=10 LBA=2292734035 and Mar 4 08:56:15 offsite kernel: siisch1: siis_timeout is 00040000 ss 04000000 rs 04000000 es 00000000 sts 801e2000 serr 00000000 Mar 4 09:18:33 offsite kernel: siisch1: Timeout on slot 26 Mar 4 09:18:33 offsite kernel: siisch1: siis_timeout is 00040000 ss 04000000 rs 04000000 es 00000000 sts 801b2000 serr 00000000 Mar 4 09:21:09 offsite kernel: siisch1: Timeout on slot 26 Mar 4 09:21:09 offsite kernel: siisch1: siis_timeout is 00040000 ss 04000000 rs 04000000 es 00000000 sts 801d2000 serr 00000000 Mar 4 09:22:44 offsite kernel: siisch1: Timeout on slot 26 Mar 4 09:22:44 offsite kernel: siisch1: siis_timeout is 00040000 ss 04000000 rs 04000000 es 00000000 sts 801d2000 serr 00000000 Mar 4 09:23:16 offsite kernel: siisch1: Timeout on slot 30 Mar 4 09:23:16 offsite kernel: siisch1: siis_timeout is 00040000 ss 40000000 rs 40000000 es 00000000 sts 801a2000 serr 00000000 on multiple disks and on multiple controllers... I have disks off the MB and off 2 PMPs on an sil3124 controller. We narrowed it down to 2 problems. Failing / Marginal power supply and bad SATA cables. After changing the power supply, we still had a few disks errors. smartctl said all disks didnt have errors... Changed the SATA cables, and those too were fixed. After almost 5 days of uptime, no problems at all now. Not one error. ---Mike ------------------- Mike Tancsa, tel +1 519 651 3400 Sentex Communications, mike@sentex.net Providing Internet services since 1994 www.sentex.net Cambridge, Ontario Canada http://www.tancsa.com/