From owner-freebsd-fs@FreeBSD.ORG Mon Feb 11 19:39:13 2008 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 62F8D16A419; Mon, 11 Feb 2008 19:39:13 +0000 (UTC) (envelope-from joe@skyrush.com) Received: from shadow.wildlava.net (shadow.wildlava.net [67.40.138.81]) by mx1.freebsd.org (Postfix) with ESMTP id 1DBED13C447; Mon, 11 Feb 2008 19:39:12 +0000 (UTC) (envelope-from joe@skyrush.com) Received: from crater.wildlava.net (crater.wildlava.net [67.40.138.82]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by shadow.wildlava.net (Postfix) with ESMTP id 2B5F98F3B2; Mon, 11 Feb 2008 12:39:12 -0700 (MST) Message-ID: <47B0A45C.4090909@skyrush.com> Date: Mon, 11 Feb 2008 12:39:08 -0700 From: Joe Peterson User-Agent: Thunderbird 2.0.0.9 (X11/20071208) MIME-Version: 1.0 To: Gavin Atkinson References: <47ACD7D4.5050905@skyrush.com> <47ACDE82.1050100@skyrush.com> <20080208173517.rdtobnxqg4g004c4@www.wolves.k12.mo.us> <47ACF0AE.3040802@skyrush.com> <1202747953.27277.7.camel@buffy.york.ac.uk> In-Reply-To: <1202747953.27277.7.camel@buffy.york.ac.uk> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-fs@freebsd.org, freebsd-stable@freebsd.org Subject: Re: Analysis of disk file block with ZFS checksum error X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 11 Feb 2008 19:39:13 -0000 Gavin Atkinson wrote: > Are the datestamps (Thu Jan 24 23:20:58 2008) found within the corrupt > block before or after the datestamp of the file it was found within? > i.e. was the corrupt block on the disk before or after the mp3 was > written there? Hi Gavin, those dated are later than the original copy (I do not have the file timestamps to prove this, but according to my email record, I am pretty sure of this). So the corrupt block is later than the original write. If this is the case, I assume that the block got written, by mistake, into the middle of the mp3 file. Someone else suggested that it could be caused by a bad transfer block number or bad drive command (corrupted on the way to the drive, since these are not checksummed in the hardware). If the block went to the wrong place, AND if it was a HW glitch, I suppose the best ZFS could then do is retry the write (if its failure was even detected - still not sure if ZFS does a re-check of the disk data checksum after the disk write), not knowing until the later scrub that the block had corrupted a file. I think that anything is possible, but I know I was getting periodic DMA timeouts, etc. around that time. I hesitate, although it is tempting, to use this evidence to focus blame purely on bad HW, given that others seem to be seeing DMA problems too, and there is reasonable doubt whether their problems are HW related or not. In my case, I have been free of DMA errors (cross your fingers) after re-installed FreeBSD completely (giving it a larger boot partition and redoing the ZFS slice too), and before this, I changed the IDE cable just to eliminate one more variable. Therefore, there are too many variables to reach a firm conclusion, since even if the cable was "bad", I never saw one DMA error or other indication of anything wrong with HW from the Linux side (and I've been using that HW with both Linux and FreeBSD 6.2 for months now - no apparent flakiness of any kind on either system). So either it *was* bad and FreeBSD 7.0 was being more "honest", FreeBSD's drivers and/or ZFS was stressing the HW and revealing weaknesses in the cable, or it was a SW issue that got cleared somehow when I re-installed. Is it possible that the problem lies in the ATA drivers in FreeBSD or even in ZFS and just looks like HW issues? I do not have enough info/expertise to know. If not, then it may very well be true that HW problems are pretty widespread (and that disk HW cannot, in fact, be trusted), and there really *is* a strong need for ZFS *now* to protect our data. If there is a possibility that SW could be involved, any hints on how to further debug this would be of great help to those still experiencing recent DMA errors. I just want to be more sure one way or the other, but I know this issue is not an easy one (however, it's the kind of problem that should receive the highest priority, IMHO). -Joe