Date: Thu, 19 Jul 2012 18:05:32 +0100 From: Dr Joe Karthauser <joe@tao.org.uk> To: James Snow <snow@teardrop.org> Cc: "freebsd-stable@freebsd.org" <freebsd-stable@freebsd.org> Subject: Re: Checksum errors across ZFS array Message-ID: <002D6A20-D2A4-4909-B2EA-3DB562326050@tao.org.uk> In-Reply-To: <20120719152909.GL32960@teardrop.org> References: <20120719152909.GL32960@teardrop.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Hi James, It's almost definitely a memory problem. I'd change it ASAP if I were you. I lost about 70mb from my zfs pool for this very reason just a few weeks ago= . Luckily I had enough snapshots from before the rot set in to recover most o= f what I lost. Joe --=20 Dr Joe Karthauser On 19 Jul 2012, at 16:29, James Snow <snow@teardrop.org> wrote: > I have a ZFS server on which I've seen periodic checksum errors on > almost every drive. While scrubbing the pool last night, it began to > report unrecoverable data errors on a single file. >=20 > I compared an md5 of the supposedly corrupted file to an md5 of the > original copy, stored on different media. They were the same, suggesting > no corruption. >=20 > A large file was being written to the pool while the scrub was in > progress, and the entire array became unresponsive. The OS was still up, > but 'zpool status' showed the scrub progress stuck at the same spot, > with the throughput rate falling. 'shutdown -r now' stalled. Eventually > I hard power cycled the system. >=20 > Now, attempting to read the file that ZFS reports errors on yields > "Input/output error." The scrub completed, with the following result: >=20 > NAME STATE READ WRITE CKSUM > tank ONLINE 0 0 7 > mirror-0 ONLINE 0 0 0 > aacd0p1 ONLINE 0 0 0 > aacd4p1 ONLINE 0 0 1 > mirror-1 ONLINE 0 0 0 > aacd1p1 ONLINE 0 0 0 > aacd5p1 ONLINE 0 0 0 > mirror-2 ONLINE 0 0 14 > aacd2p1 ONLINE 0 0 14 > aacd6p1 ONLINE 0 0 14 > mirror-3 ONLINE 0 0 0 > aacd3p1 ONLINE 0 0 0 > aacd7p1 ONLINE 0 0 0 >=20 > The system configuration is as follows: >=20 > Controller: Adaptec 2805=20 > Motherboard: Supermicro X8STE > Drive Cage: 2x Supermicro CSE-M35T-1 > Memory: 2x Kingston 12GB ECC (KVR1066D3E7SK3/12G) > PSU: Nexus RX-7000 > OS: 9.0-RELEASE-p3 > ZFS: ZFS filesystem version 5, ZFS storage pool version 28 >=20 >=20 > The Adaptec card has 2 ports, each of which uses a 4-port fan-out cable. > The cables are routed as shown: >=20 > /--- aacd0 (ST1000DM003-9YN1 CC4D) > / /-- aacd1 (ST1000DM003-9YN1 CC4D) > p1----- > \ \-- aacd2 (WDC WD1001FALS-0 05.0) > \--- aacd3 (WDC WD1001FALS-0 05.0) >=20 > /--- aacd4 (ST1000DM003-9YN1 CC4D) > / /-- aacd5 (ST1000DM003-9YN1 CC4D) > p2----- > \ \-- aacd6 (WDC WD1002FAEX-0 05.0) > \--- aacd7 (WDC WD1002FAEX-0 05.0) >=20 > You can see that each ZFS mirror device is comprised of one drive from > each drive carrier, on separate ports, on separate cables. >=20 > Since I have seen periodic checksum errors on almost every drive but the > only common component is the Adapter controller and the motherboard, I > suspect the controller. (Or the motherboard, but I'm starting with the > controller since it's much simpler to swap out.) >=20 > Could it be something else? What else I should be looking at? Any input > greatly appreciated. >=20 >=20 > -Snow >=20 > _______________________________________________ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org" >=20
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?002D6A20-D2A4-4909-B2EA-3DB562326050>