From owner-freebsd-fs@FreeBSD.ORG Tue Feb 5 19:31:35 2008 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 36F2D16A418 for ; Tue, 5 Feb 2008 19:31:35 +0000 (UTC) (envelope-from bakul@bitblocks.com) Received: from mail.bitblocks.com (ns1.bitblocks.com [64.142.15.60]) by mx1.freebsd.org (Postfix) with ESMTP id 136F413C467 for ; Tue, 5 Feb 2008 19:31:34 +0000 (UTC) (envelope-from bakul@bitblocks.com) Received: from bitblocks.com (localhost.bitblocks.com [127.0.0.1]) by mail.bitblocks.com (Postfix) with ESMTP id 3D69C5B59; Tue, 5 Feb 2008 11:09:46 -0800 (PST) To: Joe Peterson In-reply-to: Your message of "Tue, 05 Feb 2008 10:38:23 MST." <47A89F0F.1030505@skyrush.com> Date: Tue, 05 Feb 2008 11:09:45 -0800 From: Bakul Shah Message-Id: <20080205190946.3D69C5B59@mail.bitblocks.com> Cc: freebsd-fs@freebsd.org Subject: Re: Forcing full file read in ZFS even when checksum error encountered X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Feb 2008 19:31:35 -0000 > I've checked SMART - no [unrecoverable] errors and no additional sector > reallocations, and I've done a SeaTools long test - no problems found. > > But I do not understand: in zpool status, there are stats on read errors in > addition to checksum errors. If I understand correctly, a read error would be > the system/HW reporting an error on read, whereas the whole idea of the > checksums in ZFS is to catch errors that are *not* reported as read errors > (i.e. silent bit changes that normal filesystems would never catch). What I > seem to be seeing is a case in which ZFS says the checksum is wrong. There > are only counts in the CKSUM col, not the other cols in the status, so I do > not think this is a "read error" - it is ZFS's last line of defense (the > checksum) reporting a mismatch. > > In other words, I assume the read would complete if ZFS did not catch the > checksum mismatch, and what I'd like to do is let it complete so I can see for > myself where these bit errors are by comparing the read file to a known good > copy (that I have). If there are no mismatches, it would mean there is a > metadata error of ZFS bug. It could also be a memory error of some sort. Does your system haev ECC memory? Also note that standalone tests do not seem to catch all sorts of errors that heavy use of Unix can sometimes trigger on a marginal system. But I agree with you that it would be useful to have a debug mode where you can get at the data even if it is bad (and a test mode where you can write bad data on purpose:-). [A long rant on writing testable code deleted] You have access to the zfs sources! At the very least you can add code to report the bad checksum & offset and see if matches with checksum of the same block(s) in your known good copy.