From owner-freebsd-fs@FreeBSD.ORG  Tue Feb  5 19:31:35 2008
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 36F2D16A418
	for <freebsd-fs@freebsd.org>; Tue,  5 Feb 2008 19:31:35 +0000 (UTC)
	(envelope-from bakul@bitblocks.com)
Received: from mail.bitblocks.com (ns1.bitblocks.com [64.142.15.60])
	by mx1.freebsd.org (Postfix) with ESMTP id 136F413C467
	for <freebsd-fs@freebsd.org>; Tue,  5 Feb 2008 19:31:34 +0000 (UTC)
	(envelope-from bakul@bitblocks.com)
Received: from bitblocks.com (localhost.bitblocks.com [127.0.0.1])
	by mail.bitblocks.com (Postfix) with ESMTP id 3D69C5B59;
	Tue,  5 Feb 2008 11:09:46 -0800 (PST)
To: Joe Peterson <joe@skyrush.com>
In-reply-to: Your message of "Tue, 05 Feb 2008 10:38:23 MST."
	<47A89F0F.1030505@skyrush.com> 
Date: Tue, 05 Feb 2008 11:09:45 -0800
From: Bakul Shah <bakul@bitblocks.com>
Message-Id: <20080205190946.3D69C5B59@mail.bitblocks.com>
Cc: freebsd-fs@freebsd.org
Subject: Re: Forcing full file read in ZFS even when checksum error
	encountered 
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 05 Feb 2008 19:31:35 -0000

> I've checked SMART - no [unrecoverable] errors and no additional sector
> reallocations, and I've done a SeaTools long test - no problems found.
> 
> But I do not understand: in zpool status, there are stats on read errors in
> addition to checksum errors.  If I understand correctly, a read error would be
> the system/HW reporting an error on read, whereas the whole idea of the
> checksums in ZFS is to catch errors that are *not* reported as read errors
> (i.e. silent bit changes that normal filesystems would never catch).  What I
> seem to be seeing is a case in which ZFS says the checksum is wrong.  There
> are only counts in the CKSUM col, not the other cols in the status, so I do
> not think this is a "read error" - it is ZFS's last line of defense (the
> checksum) reporting a mismatch.
> 
> In other words, I assume the read would complete if ZFS did not catch the
> checksum mismatch, and what I'd like to do is let it complete so I can see for
> myself where these bit errors are by comparing the read file to a known good
> copy (that I have).  If there are no mismatches, it would mean there is a
> metadata error of ZFS bug.

It could also be a memory error of some sort.  Does your
system haev ECC memory?  Also note that standalone tests do
not seem to catch all sorts of errors that heavy use of Unix
can sometimes trigger on a marginal system.

But I agree with you that it would be useful to have a debug
mode where you can get at the data even if it is bad (and a
test mode where you can write bad data on purpose:-). [A
long rant on writing testable code deleted]

You have access to the zfs sources! At the very least you can
add code to report the bad checksum & offset and see if
matches with checksum of the same block(s) in your known good
copy.