From owner-freebsd-fs@FreeBSD.ORG Tue Mar 4 13:25:42 2008 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9690B1065672 for ; Tue, 4 Mar 2008 13:25:42 +0000 (UTC) (envelope-from anderson@freebsd.org) Received: from ns.trinitel.com (186.161.36.72.static.reverse.ltdomains.com [72.36.161.186]) by mx1.freebsd.org (Postfix) with ESMTP id 631C88FC1B for ; Tue, 4 Mar 2008 13:25:42 +0000 (UTC) (envelope-from anderson@freebsd.org) Received: from proton.storspeed.com (209-163-168-124.static.tenantsolutions.net [209.163.168.124] (may be forged)) (authenticated bits=0) by ns.trinitel.com (8.14.1/8.14.1) with ESMTP id m24DPats041502; Tue, 4 Mar 2008 07:25:40 -0600 (CST) (envelope-from anderson@freebsd.org) Message-ID: <47CD4DCF.5070505@freebsd.org> Date: Tue, 04 Mar 2008 07:25:35 -0600 From: Eric Anderson User-Agent: Thunderbird 2.0.0.12 (Macintosh/20080213) MIME-Version: 1.0 To: Joe Peterson References: <47ACD7D4.5050905@skyrush.com> <47ACDE82.1050100@skyrush.com> <20080208173517.rdtobnxqg4g004c4@www.wolves.k12.mo.us> <47ACF0AE.3040802@skyrush.com> <1202747953.27277.7.camel@buffy.york.ac.uk> <47B0A45C.4090909@skyrush.com> In-Reply-To: <47B0A45C.4090909@skyrush.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-2.2 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.1.8 X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on ns.trinitel.com Cc: freebsd-fs@freebsd.org, freebsd-stable@freebsd.org Subject: Re: Analysis of disk file block with ZFS checksum error X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 04 Mar 2008 13:25:42 -0000 Joe Peterson wrote: > Gavin Atkinson wrote: >> Are the datestamps (Thu Jan 24 23:20:58 2008) found within the corrupt >> block before or after the datestamp of the file it was found within? >> i.e. was the corrupt block on the disk before or after the mp3 was >> written there? > > Hi Gavin, those dated are later than the original copy (I do not have > the file timestamps to prove this, but according to my email record, I > am pretty sure of this). So the corrupt block is later than the > original write. > > If this is the case, I assume that the block got written, by mistake, > into the middle of the mp3 file. Someone else suggested that it could > be caused by a bad transfer block number or bad drive command (corrupted > on the way to the drive, since these are not checksummed in the > hardware). If the block went to the wrong place, AND if it was a HW > glitch, I suppose the best ZFS could then do is retry the write (if its > failure was even detected - still not sure if ZFS does a re-check of the > disk data checksum after the disk write), not knowing until the later > scrub that the block had corrupted a file. > > I think that anything is possible, but I know I was getting periodic DMA > timeouts, etc. around that time. I hesitate, although it is tempting, > to use this evidence to focus blame purely on bad HW, given that others > seem to be seeing DMA problems too, and there is reasonable doubt > whether their problems are HW related or not. In my case, I have been > free of DMA errors (cross your fingers) after re-installed FreeBSD > completely (giving it a larger boot partition and redoing the ZFS slice > too), and before this, I changed the IDE cable just to eliminate one > more variable. Therefore, there are too many variables to reach a firm > conclusion, since even if the cable was "bad", I never saw one DMA error > or other indication of anything wrong with HW from the Linux side (and > I've been using that HW with both Linux and FreeBSD 6.2 for months now - > no apparent flakiness of any kind on either system). So either it *was* > bad and FreeBSD 7.0 was being more "honest", FreeBSD's drivers and/or > ZFS was stressing the HW and revealing weaknesses in the cable, or it > was a SW issue that got cleared somehow when I re-installed. > > Is it possible that the problem lies in the ATA drivers in FreeBSD or > even in ZFS and just looks like HW issues? I do not have enough > info/expertise to know. If not, then it may very well be true that HW > problems are pretty widespread (and that disk HW cannot, in fact, be > trusted), and there really *is* a strong need for ZFS *now* to protect > our data. If there is a possibility that SW could be involved, any > hints on how to further debug this would be of great help to those still > experiencing recent DMA errors. I just want to be more sure one way or > the other, but I know this issue is not an easy one (however, it's the > kind of problem that should receive the highest priority, IMHO). I'm not sure what happened to this thread, but I also had a lot of similar issues. I was using SATA, and using a mirrored pair of SATA drives, brand new. It was suggested that my controller was junk. I'm starting to think there is a timing issue or some such problem with ZFS, since I can use the same drives in a gmirror with UFS, and never have any data problems (md5 checksums confirm it over-and-over). I highly doubt that everyone is seeing similar issues and it just is because ZFS is so intense. I've had plenty of systems under severe disk load that have never exhibited corrupt files because of something like this. I wish we could get our hands on this issue.. Seems like some common threads are ATA/SATA disks. Is your setup running 32bit or 64bit FreeBSD? (if you already mentioned it, I'm sorry, I missed it) Eric