From owner-freebsd-stable@FreeBSD.ORG Thu Jul 19 17:12:54 2012 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 778171065670 for ; Thu, 19 Jul 2012 17:12:54 +0000 (UTC) (envelope-from joe@tao.org.uk) Received: from babel.karthauser.co.uk (babel.realityhacker.info [95.154.203.112]) by mx1.freebsd.org (Postfix) with ESMTP id 10AEF8FC18 for ; Thu, 19 Jul 2012 17:12:54 +0000 (UTC) Received: from [90.155.77.79] (unknown [90.155.77.79]) (Authenticated sender: joemail@tao.org.uk) by babel.karthauser.co.uk (Postfix) with ESMTPA id AC725A18; Thu, 19 Jul 2012 17:05:32 +0000 (UTC) References: <20120719152909.GL32960@teardrop.org> In-Reply-To: <20120719152909.GL32960@teardrop.org> Mime-Version: 1.0 (1.0) Content-Type: text/plain; charset=us-ascii Message-Id: <002D6A20-D2A4-4909-B2EA-3DB562326050@tao.org.uk> Content-Transfer-Encoding: quoted-printable X-Mailer: iPhone Mail (9B206) From: Dr Joe Karthauser Date: Thu, 19 Jul 2012 18:05:32 +0100 To: James Snow Cc: "freebsd-stable@freebsd.org" Subject: Re: Checksum errors across ZFS array X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Jul 2012 17:12:54 -0000 Hi James, It's almost definitely a memory problem. I'd change it ASAP if I were you. I lost about 70mb from my zfs pool for this very reason just a few weeks ago= . Luckily I had enough snapshots from before the rot set in to recover most o= f what I lost. Joe --=20 Dr Joe Karthauser On 19 Jul 2012, at 16:29, James Snow wrote: > I have a ZFS server on which I've seen periodic checksum errors on > almost every drive. While scrubbing the pool last night, it began to > report unrecoverable data errors on a single file. >=20 > I compared an md5 of the supposedly corrupted file to an md5 of the > original copy, stored on different media. They were the same, suggesting > no corruption. >=20 > A large file was being written to the pool while the scrub was in > progress, and the entire array became unresponsive. The OS was still up, > but 'zpool status' showed the scrub progress stuck at the same spot, > with the throughput rate falling. 'shutdown -r now' stalled. Eventually > I hard power cycled the system. >=20 > Now, attempting to read the file that ZFS reports errors on yields > "Input/output error." The scrub completed, with the following result: >=20 > NAME STATE READ WRITE CKSUM > tank ONLINE 0 0 7 > mirror-0 ONLINE 0 0 0 > aacd0p1 ONLINE 0 0 0 > aacd4p1 ONLINE 0 0 1 > mirror-1 ONLINE 0 0 0 > aacd1p1 ONLINE 0 0 0 > aacd5p1 ONLINE 0 0 0 > mirror-2 ONLINE 0 0 14 > aacd2p1 ONLINE 0 0 14 > aacd6p1 ONLINE 0 0 14 > mirror-3 ONLINE 0 0 0 > aacd3p1 ONLINE 0 0 0 > aacd7p1 ONLINE 0 0 0 >=20 > The system configuration is as follows: >=20 > Controller: Adaptec 2805=20 > Motherboard: Supermicro X8STE > Drive Cage: 2x Supermicro CSE-M35T-1 > Memory: 2x Kingston 12GB ECC (KVR1066D3E7SK3/12G) > PSU: Nexus RX-7000 > OS: 9.0-RELEASE-p3 > ZFS: ZFS filesystem version 5, ZFS storage pool version 28 >=20 >=20 > The Adaptec card has 2 ports, each of which uses a 4-port fan-out cable. > The cables are routed as shown: >=20 > /--- aacd0 (ST1000DM003-9YN1 CC4D) > / /-- aacd1 (ST1000DM003-9YN1 CC4D) > p1----- > \ \-- aacd2 (WDC WD1001FALS-0 05.0) > \--- aacd3 (WDC WD1001FALS-0 05.0) >=20 > /--- aacd4 (ST1000DM003-9YN1 CC4D) > / /-- aacd5 (ST1000DM003-9YN1 CC4D) > p2----- > \ \-- aacd6 (WDC WD1002FAEX-0 05.0) > \--- aacd7 (WDC WD1002FAEX-0 05.0) >=20 > You can see that each ZFS mirror device is comprised of one drive from > each drive carrier, on separate ports, on separate cables. >=20 > Since I have seen periodic checksum errors on almost every drive but the > only common component is the Adapter controller and the motherboard, I > suspect the controller. (Or the motherboard, but I'm starting with the > controller since it's much simpler to swap out.) >=20 > Could it be something else? What else I should be looking at? Any input > greatly appreciated. >=20 >=20 > -Snow >=20 > _______________________________________________ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org" >=20