Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 05 Jan 2001 21:08:28 +0100
From:      Martin Birgmeier <Martin.Birgmeier@aon.at>
To:        FreeBSD-gnats-submit@FreeBSD.ORG
Subject:   kern/24092: Disk data corruption using FreeBSD_4_2_0_RELEASE
Message-ID:  <3A5629BC.8147CE64@aon.at>
Resent-Message-ID: <200101052010.f05KA2G65273@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help

>Number:         24092
>Category:       kern
>Synopsis:       Disk data corruption using FreeBSD_4_2_0_RELEASE
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Fri Jan 05 12:10:01 PST 2001
>Closed-Date:
>Last-Modified:
>Originator:     Martin.Birgmeier@aon.at (Martin Birgmeier)
>Release:        FreeBSD 4.2-RELEASE i386
>Organization:
MBi at home
>Environment:

ASUS A7V, 256 MB main memory
disks as shown below

atapci0: <VIA 82C686 ATA66 controller> port 0xd800-0xd80f at device 4.1 on pci0
ata0: at 0x1f0 irq 14 on atapci0
ata1: at 0x170 irq 15 on atapci0
atapci1: <Promise ATA100 controller> port
0x8000-0x803f,0x8400-0x8403,0x8800-0x8807,0x9000-0x9003,0x9400-0x9407 mem 0xd5000000-0xd501ffff irq
10 at device 17.0 on pci0
ad0: 27199MB <ST328040A> [55262/16/63] at ata0-master UDMA33
ad3: 29188MB <ST330630A> [59303/16/63] at ata1-slave UDMA66
acd0: CDROM <TOSHIBA CD-ROM XM-6602B> at ata0-slave using UDMA33
acd1: CD-RW <PLEXTOR CD-R PX-W1210A> at ata1-master using WDMA2
Mounting root from ufs:/dev/ad0s4a

df output:

Filesystem               1K-blocks     Used    Avail Capacity  Mounted on
/dev/ad0s4a                  79359    40819    32192    56%    /
devfs                           16       16        0   100%    dummy_mount
/dev/ad0s4e                  39647     3138    33338     9%    /var
/dev/ad0s4f                8825163  6923321  1195829    85%    /usr
/dev/ad0s4g               13605124  8565455  3951260    68%    /d/5s4g
/dev/ad3s1a                  79359    53254    19757    73%    /d/6s1a
/dev/ad3s4e               22491456  8855727 11836413    43%    /d/6s4e
mfs:34                      127023       16   116846     0%    /tmp
procfs                           4        4        0   100%    /proc
devfs                           16       16        0   100%    /devs
<above>:/usr/X11R6.local  17650326 15748484  1195829    93%    /usr/X11R6.4
pid147@gandalf:/vol              0        0        0   100%    /vol
pid147@gandalf:/users            0        0        0   100%    /users
pid147@gandalf:/srcs             0        0        0   100%    /srcs
pid147@gandalf:/opt              0        0        0   100%    /opt
pid147@gandalf:/d/auto           0        0        0   100%    /d/auto

- Sources gotten via CTM
- checkout via `cvs -R co -rRELENG_4_2_0_RELEASE src'
- buildworld, installworld

(In fact, buildworld stopped on a corrupted source file, which was one of
the earliest hints I got that something is very wrong.)

>Description:

See below. When doing a cmp -x on the files just copied, it turns
out that blocks of size 2 ** n, with n between 6 and 12 inclusive,
are corrupt (sometimes more than one such block in the same
file).

Fortunately, mostly (but not only!) long files are affected.

>How-To-Repeat:

Use the following shell script. The file "SRC" contains data:
% ls -l /d/5s4g/fileX
-rw-r--r--  1 root  wheel  1083285504 Jan  2 14:24 .../fileX
%

----------------------------------------------------------------------
#! /bin/sh

SRC=/d/5s4g/fileX
DST=/d/6s4e/file

for i in 1 2 3 4 5 6 7 8
do
        echo "*** $i ***"
        dd if="$SRC" of="${DST}$i" bs=102400k || break
        for j in 1 2 3
        do
                cmp "$SRC" "${DST}$i" && break
        done
done
----------------------------------------------------------------------

What happens is that in about 50% of the cases, the compare does not
succeed (I once had a case where a compare failed on the first try,
but later succeeded; hence the triple comparison).

This happens most often on large(r) files, which is exactly the reason
why I am using a file of about 1 GB for testing.

Notes: I tested copying a file of 800 MB under Win98 twice - no
problems.  In addition, I installed Suse Linux 7.0 on ad0s3, and
tested copying a file of about 400 MB four times using the above
shell script (as in `for i in 1 2 3 4'...). No problems. Reason
for somewhat smaller file sizes is that I don't have much disk
space devoted to the other environments.

>Fix:

Unknown.

However, I tried the following, without any improvements:

- In /sys/dev/ata/ata-all.c, made ata_umode() return -1 always. As a
  result, the disks used WDMA2.

- In /sys/dev/ata/ata-disk.c, ad_attach() (only one at a time of the
  following items):
  . fixed adp->transfersize at DEV_BSIZE
  . disabled write caching

With all this, I am pretty sure that the problem lies not with my
hardware, but within the vm/buffer subsystem and its interaction
with some other service, possibly malloc (corruptions seem to
always be powers of two in length, see above).

-- 
Martin Birgmeier

Vienna
Austria

>Release-Note:
>Audit-Trail:
>Unformatted:


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-bugs" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3A5629BC.8147CE64>