From owner-freebsd-fs@FreeBSD.ORG  Sun Nov 17 16:53:50 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 7CCDE9FF;
 Sun, 17 Nov 2013 16:53:50 +0000 (UTC)
Received: from mail-pa0-x22b.google.com (mail-pa0-x22b.google.com
 [IPv6:2607:f8b0:400e:c03::22b])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 5381B278A;
 Sun, 17 Nov 2013 16:53:50 +0000 (UTC)
Received: by mail-pa0-f43.google.com with SMTP id fa1so5715616pad.16
 for <multiple recipients>; Sun, 17 Nov 2013 08:53:50 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:in-reply-to:references:date:message-id:subject:from:to
 :cc:content-type;
 bh=9lZWa594pIiuHci5aUMAf+DBoZ3Jqfjg+S8opCMbfy0=;
 b=Eh0p/jnmQWOMbc4KBOIu+5tiQyW+HRZRV76wiDueX+GppqPql1pMBvjg024WJ4jwub
 KjM8GTfBcNqyqcfVx/E4yo1f66NU6z93f+mE1iJMhvmNvNBX1vryBz50AjM8XHvWfeUp
 pQCuRYR/nMefUpD6NQ738dj2l72o66z9ocqD5UKdIzTtWAHuw5541SE/qYWGgkVremQf
 ULLhd4a/c13Ag6AT7pT1+PGa+TzZYItTJ6ipVvZgSwb5LJfZZWtVT4JqG0HXZjmYpjz/
 xkgpAi3sp//HTuUpQP2N0maOzIUZff232g+kKCFyTq2GKONbEb7o4fgp+O7v4HLoJ/49
 BBkQ==
MIME-Version: 1.0
X-Received: by 10.68.163.33 with SMTP id yf1mr3078716pbb.143.1384707229929;
 Sun, 17 Nov 2013 08:53:49 -0800 (PST)
Received: by 10.70.92.79 with HTTP; Sun, 17 Nov 2013 08:53:49 -0800 (PST)
In-Reply-To: <9CB46A22C0BE40029652144B2586462A@d40>
References: <9CB46A22C0BE40029652144B2586462A@d40>
Date: Sun, 17 Nov 2013 10:53:49 -0600
Message-ID: <CA+tpaK3fdZ1fZ+GXTVV1XLf6+S=HMVvp8fBp3R=X7W-nt-_szw@mail.gmail.com>
Subject: Re: rare, random issue with read(), mmap() failing to read entire file
From: Adam Vande More <amvandemore@gmail.com>
To: John Refling <netbsdrat@gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
X-Content-Filtered-By: Mailman/MimeDel 2.1.16
Cc: freebsd-fs <freebsd-fs@freebsd.org>,
 FreeBSD Questions <freebsd-questions@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.16
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 17 Nov 2013 16:53:50 -0000

On Fri, Nov 15, 2013 at 8:56 PM, John Refling <netbsdrat@gmail.com> wrote:

>
>
> I'm having some very insidious issues with copying and verifying
> (identical)
> data from several hard disks.  This might be a hardware issue or something
> very deep in the disk / filesystem code.  I have verified this with several
> disks and motherboards.  It corrupts 0.0096% of my files, different files
> each time!
>
>
>
> Background:
>
>
>
> 1.  I have a 500 GB USB hard disk (the new 4,096 [4k] sector size) which I
> have been using to store a master archive of over 70,000 files.
>
>
>
> 2.  To make a backup of the USB disk, I copied everything over to a 500 GB
> SATA hard disk.  [Various combinations of `cp -r', `scp -r', `tar -cf - . |
> rsh ... tar -xf -', etc.]
>
>
>
> 3.  To verify that the copy was correct, I did sha256 sums of all files on
> both disks.
>
>
>
> 4.  When comparing the sha256 sums on both drives, I discovered that 6 or
> so
> files did not compare OK from one drive to the other.
>
>
>
> 5.  When I checked the files individually, the files compared OK, and even
> when I recomputed their individual sha256 sums, I got DIFFERENT sha256 sums
> which were correct this time!
>
>
>
> The above lead me to investigate further, and using ONLY the USB disk, I
> recomputed the sha256 sums for all files ON THAT DISK.  A small number
> (6-12) of files ON THE SAME DISK had different sha256 sums than previously
> computed!  The disk is read-only so nothing could have changed.
>
>
>
> To try to get to the bottom of this, I took the sha256 code and put it in
> my
> own file reading routine, which reads-in data from the file using read().
> On summing up the total bytes read in the read() loop, I discovered that on
> the files that failed to compare, the read() returned EOF before the actual
> EOF. According to the manual page this is impossible.  I compared the total
> number of bytes read by the read() loop to the stat() file length value,
> and
> they were different!  Obviously, the sha256 sum will be different since not
> all the file is read.
>
>
>
> This happens consistently on 6 to 12 files out of 70,000+ *every* time, and
> on DIFFERENT files *every* time.  So things work 99.9904% of the time.
>
>
>
> But something fails 0.0096% (one hundredth of one percent) of the time,
> which with a large number of files is significant!
>
>
>
> Instead of read(), I tried mmap()ing chunks of the file.  Using mmap() to
> access the data in the file instead of read() resulted in a (different)
> sha256 sum than the read() version!  The mmap() version was correct, except
> in ONE case where BOTH versions were WRONG, when compared to a 3rd and 4th
> run!
>
>
>
> Using `diff -rq disk1 disk2` resulted in similar issues.  There were always
> a few files that failed to compare.  Doing another `diff -rq disk1 disk2`
> resulted in a few *other* files that failed to compare, while the ones that
> didn't compare OK the first time, DID compare OK the second time.  This
> happened to 6-12 files out of 70,000+.
>
>
>
> Whatever is affecting my use of read() in my sha256 routine seems to also
> affect system utilities such as diff!
>
>
>
> This gets really insidious because I don't know if the original `cp -r
> disk1
> disk2` did these short reads on a few files while copying the files, thus
> corrupting my archive backup (on 6-12 files)!
>
>
>
> Some of the files that fail are small (10KB) and some are huge (8GB).
>
>
>
> HELP!
>
>
>
> It takes 7 hours to recompute the sha256 sums of the files on the disk so
> random experiments are time consuming, but I'm willing to try things that
> are suggested.
>
>
>
> System details:
>
>
>
> This is observed with the following disks:
>
>
>
> Western Digital 500GB SATA 512 byte sectors
>
> Hitachi 500GB SATA 512 byte sectors
>
> Iomega RPHD-UG3 500GB USB 4096 byte sectors
>
>
>
> in combination with these motherboards:
>
>
>
> P4M800Pro-M V2.0: Pentium D 2.66 GHz, 2GB memory
>
> HP/Compaq Evo: Pentium 4, 2.8 GHz, 2GB memory
>
>
>
> OP System version:
>
> Freebsd: 9.1 RELEASE #0
>
>
>
> no hardware errors noted in /var/log/messages during the file reading
>
>
>
> did Spinrite on disks to freshen (re-read/write) all sectors, with no
> errors.
>
>
>
> The file systems were built using:
>
>
>
> dd if=/dev/zero of=/dev/xxx bs=2m
>
> newfs -m0 /dev/xxx
>
>
>
> Looked through the mailing lists and bug reports but can't see anything
> similar.
>
>
>
> Thanks for your help,
>
>
>
> John Refling
>

Try recoverdisk(1)



-- 
Adam