From owner-freebsd-stable@FreeBSD.ORG Thu Jul 19 15:36:16 2012 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BDFC5106566C for ; Thu, 19 Jul 2012 15:36:16 +0000 (UTC) (envelope-from snow@teardrop.org) Received: from sputnik.teardrop.org (sputnik.teardrop.org [65.98.88.18]) by mx1.freebsd.org (Postfix) with ESMTP id 982828FC08 for ; Thu, 19 Jul 2012 15:36:16 +0000 (UTC) Received: by sputnik.teardrop.org (Postfix, from userid 30000) id 58CF57E87E; Thu, 19 Jul 2012 15:29:09 +0000 (UTC) Date: Thu, 19 Jul 2012 08:29:09 -0700 From: James Snow To: freebsd-stable@freebsd.org Message-ID: <20120719152909.GL32960@teardrop.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.3i Subject: Checksum errors across ZFS array X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Jul 2012 15:36:16 -0000 I have a ZFS server on which I've seen periodic checksum errors on almost every drive. While scrubbing the pool last night, it began to report unrecoverable data errors on a single file. I compared an md5 of the supposedly corrupted file to an md5 of the original copy, stored on different media. They were the same, suggesting no corruption. A large file was being written to the pool while the scrub was in progress, and the entire array became unresponsive. The OS was still up, but 'zpool status' showed the scrub progress stuck at the same spot, with the throughput rate falling. 'shutdown -r now' stalled. Eventually I hard power cycled the system. Now, attempting to read the file that ZFS reports errors on yields "Input/output error." The scrub completed, with the following result: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 7 mirror-0 ONLINE 0 0 0 aacd0p1 ONLINE 0 0 0 aacd4p1 ONLINE 0 0 1 mirror-1 ONLINE 0 0 0 aacd1p1 ONLINE 0 0 0 aacd5p1 ONLINE 0 0 0 mirror-2 ONLINE 0 0 14 aacd2p1 ONLINE 0 0 14 aacd6p1 ONLINE 0 0 14 mirror-3 ONLINE 0 0 0 aacd3p1 ONLINE 0 0 0 aacd7p1 ONLINE 0 0 0 The system configuration is as follows: Controller: Adaptec 2805 Motherboard: Supermicro X8STE Drive Cage: 2x Supermicro CSE-M35T-1 Memory: 2x Kingston 12GB ECC (KVR1066D3E7SK3/12G) PSU: Nexus RX-7000 OS: 9.0-RELEASE-p3 ZFS: ZFS filesystem version 5, ZFS storage pool version 28 The Adaptec card has 2 ports, each of which uses a 4-port fan-out cable. The cables are routed as shown: /--- aacd0 (ST1000DM003-9YN1 CC4D) / /-- aacd1 (ST1000DM003-9YN1 CC4D) p1----- \ \-- aacd2 (WDC WD1001FALS-0 05.0) \--- aacd3 (WDC WD1001FALS-0 05.0) /--- aacd4 (ST1000DM003-9YN1 CC4D) / /-- aacd5 (ST1000DM003-9YN1 CC4D) p2----- \ \-- aacd6 (WDC WD1002FAEX-0 05.0) \--- aacd7 (WDC WD1002FAEX-0 05.0) You can see that each ZFS mirror device is comprised of one drive from each drive carrier, on separate ports, on separate cables. Since I have seen periodic checksum errors on almost every drive but the only common component is the Adapter controller and the motherboard, I suspect the controller. (Or the motherboard, but I'm starting with the controller since it's much simpler to swap out.) Could it be something else? What else I should be looking at? Any input greatly appreciated. -Snow