From owner-freebsd-bugs Fri Jan 5 12:10:12 2001 From owner-freebsd-bugs@FreeBSD.ORG Fri Jan 5 12:10:02 2001 Return-Path: Delivered-To: freebsd-bugs@hub.freebsd.org Received: from freefall.freebsd.org (freefall.FreeBSD.org [216.136.204.21]) by hub.freebsd.org (Postfix) with ESMTP id 98AAA37B404 for ; Fri, 5 Jan 2001 12:10:02 -0800 (PST) Received: (from gnats@localhost) by freefall.freebsd.org (8.11.1/8.11.1) id f05KA2G65273; Fri, 5 Jan 2001 12:10:02 -0800 (PST) (envelope-from gnats) Resent-Date: Fri, 5 Jan 2001 12:10:02 -0800 (PST) Resent-Message-Id: <200101052010.f05KA2G65273@freefall.freebsd.org> Resent-From: gnats-admin@FreeBSD.org (GNATS Management) Resent-To: freebsd-bugs@FreeBSD.org Resent-Reply-To: gnats-admin@FreeBSD.org, Martin Birgmeier Received: from email02.aon.at (WARSL401PIP3.highway.telekom.at [195.3.96.75]) by hub.freebsd.org (Postfix) with SMTP id DC86037B402 for ; Fri, 5 Jan 2001 12:08:49 -0800 (PST) Received: (qmail 699608 invoked from network); 5 Jan 2001 20:08:43 -0000 Received: from n807p016.dipool.highway.telekom.at (HELO aon.at) ([212.183.110.208]) (envelope-sender ) by qmail2.highway.telekom.at (qmail-ldap-1.03) with SMTP for ; 5 Jan 2001 20:08:43 -0000 Message-Id: <3A5629BC.8147CE64@aon.at> Date: Fri, 05 Jan 2001 21:08:28 +0100 From: Martin Birgmeier Sender: martin@FreeBSD.ORG To: FreeBSD-gnats-submit@FreeBSD.ORG Subject: kern/24092: Disk data corruption using FreeBSD_4_2_0_RELEASE Resent-Sender: gnats@FreeBSD.org Sender: owner-freebsd-bugs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org >Number: 24092 >Category: kern >Synopsis: Disk data corruption using FreeBSD_4_2_0_RELEASE >Confidential: no >Severity: critical >Priority: high >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Fri Jan 05 12:10:01 PST 2001 >Closed-Date: >Last-Modified: >Originator: Martin.Birgmeier@aon.at (Martin Birgmeier) >Release: FreeBSD 4.2-RELEASE i386 >Organization: MBi at home >Environment: ASUS A7V, 256 MB main memory disks as shown below atapci0: port 0xd800-0xd80f at device 4.1 on pci0 ata0: at 0x1f0 irq 14 on atapci0 ata1: at 0x170 irq 15 on atapci0 atapci1: port 0x8000-0x803f,0x8400-0x8403,0x8800-0x8807,0x9000-0x9003,0x9400-0x9407 mem 0xd5000000-0xd501ffff irq 10 at device 17.0 on pci0 ad0: 27199MB [55262/16/63] at ata0-master UDMA33 ad3: 29188MB [59303/16/63] at ata1-slave UDMA66 acd0: CDROM at ata0-slave using UDMA33 acd1: CD-RW at ata1-master using WDMA2 Mounting root from ufs:/dev/ad0s4a df output: Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/ad0s4a 79359 40819 32192 56% / devfs 16 16 0 100% dummy_mount /dev/ad0s4e 39647 3138 33338 9% /var /dev/ad0s4f 8825163 6923321 1195829 85% /usr /dev/ad0s4g 13605124 8565455 3951260 68% /d/5s4g /dev/ad3s1a 79359 53254 19757 73% /d/6s1a /dev/ad3s4e 22491456 8855727 11836413 43% /d/6s4e mfs:34 127023 16 116846 0% /tmp procfs 4 4 0 100% /proc devfs 16 16 0 100% /devs :/usr/X11R6.local 17650326 15748484 1195829 93% /usr/X11R6.4 pid147@gandalf:/vol 0 0 0 100% /vol pid147@gandalf:/users 0 0 0 100% /users pid147@gandalf:/srcs 0 0 0 100% /srcs pid147@gandalf:/opt 0 0 0 100% /opt pid147@gandalf:/d/auto 0 0 0 100% /d/auto - Sources gotten via CTM - checkout via `cvs -R co -rRELENG_4_2_0_RELEASE src' - buildworld, installworld (In fact, buildworld stopped on a corrupted source file, which was one of the earliest hints I got that something is very wrong.) >Description: See below. When doing a cmp -x on the files just copied, it turns out that blocks of size 2 ** n, with n between 6 and 12 inclusive, are corrupt (sometimes more than one such block in the same file). Fortunately, mostly (but not only!) long files are affected. >How-To-Repeat: Use the following shell script. The file "SRC" contains data: % ls -l /d/5s4g/fileX -rw-r--r-- 1 root wheel 1083285504 Jan 2 14:24 .../fileX % ---------------------------------------------------------------------- #! /bin/sh SRC=/d/5s4g/fileX DST=/d/6s4e/file for i in 1 2 3 4 5 6 7 8 do echo "*** $i ***" dd if="$SRC" of="${DST}$i" bs=102400k || break for j in 1 2 3 do cmp "$SRC" "${DST}$i" && break done done ---------------------------------------------------------------------- What happens is that in about 50% of the cases, the compare does not succeed (I once had a case where a compare failed on the first try, but later succeeded; hence the triple comparison). This happens most often on large(r) files, which is exactly the reason why I am using a file of about 1 GB for testing. Notes: I tested copying a file of 800 MB under Win98 twice - no problems. In addition, I installed Suse Linux 7.0 on ad0s3, and tested copying a file of about 400 MB four times using the above shell script (as in `for i in 1 2 3 4'...). No problems. Reason for somewhat smaller file sizes is that I don't have much disk space devoted to the other environments. >Fix: Unknown. However, I tried the following, without any improvements: - In /sys/dev/ata/ata-all.c, made ata_umode() return -1 always. As a result, the disks used WDMA2. - In /sys/dev/ata/ata-disk.c, ad_attach() (only one at a time of the following items): . fixed adp->transfersize at DEV_BSIZE . disabled write caching With all this, I am pretty sure that the problem lies not with my hardware, but within the vm/buffer subsystem and its interaction with some other service, possibly malloc (corruptions seem to always be powers of two in length, see above). -- Martin Birgmeier Vienna Austria >Release-Note: >Audit-Trail: >Unformatted: To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-bugs" in the body of the message