Date: Wed, 14 Feb 2001 14:35:47 -0800 (PST) From: mjh@aciri.org To: freebsd-gnats-submit@FreeBSD.org Subject: kern/25104: file corruption with Adaptec 29160 SCSI adapter Message-ID: <200102142235.f1EMZlP79180@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
>Number: 25104 >Category: kern >Synopsis: file corruption with Adaptec 29160 SCSI adapter >Confidential: no >Severity: critical >Priority: high >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Wed Feb 14 14:40:01 PST 2001 >Closed-Date: >Last-Modified: >Originator: Mark Handley >Release: 4.2-RELEASE >Organization: ACIRI >Environment: gaur.aciri.org: uname -a FreeBSD gaur.aciri.org 4.2-RELEASE FreeBSD 4.2-RELEASE #1: Sat Jan 20 20:49:54 PST 2001 root@gaur.aciri.org:/usr/src/sys/compile/ACIRI-4.2-USB i386 >Description: I've got five 1.1GHz Athlon systems, running FreeBSD 4.2R with 512MB RAM, Asus A7V motherboards, Adaptec 29160 U160 SCSI adaptors, and SEAGATE ST318451LW 18GB drives. The problem is I'm seeing file corruption when I write large (approx 512Mb or larger) files, especially when I write them rapidly. I can't guarantee it doesn't happen with smaller files, but I wrote a thousand 100MB files, and not one of them was corrupted. The problem basically is that the files get 64-byte chunks (usally 64, sometimes smaller)of other data in the middle of them. I first noticed the problem with scp, but the problem also happens with moderate repeatability when simply rapidly writing a big file by redirecting stdout. Here's the quick-hack test program: #include<stdio.h> #define FSIZE 1000*1024*1024 main() { int i,j; int buf[1024]; j=0; for(i=0;i<FSIZE/4;i++) { buf[j]=i; if (j==1023) { fwrite(buf, 1024, 4, stdout); j=0; } else { j++; } } } Basically it's writing 1000MB to stdout, writing incrementing values to each 32-bit word. I direct stdout to a file. The MD5 checksum of the output file should be 1da068574fdb3e3b9ffc3b2022cca171, but sometimes (somewhere between 1-in-3 and 1-in-10 tries) the file gets corrupted. The program to read this back is: #include <stdio.h> #define FSIZE 1000*1024*1024 main() { int i; int j, prev; int mode=0; for(i=0;i<FSIZE/4;i++) { fread(&j, 1, 4, stdin); if (mode==0) { if (i!=j) { printf("-----------------------------\n"); printf("problem start at word: %d\n", i); printf("got value %d instead of %d\n", j, i); mode=1; } } else { if (i==j) { printf("-----------------------------\n"); printf("last word of problem : %d\n", i-1); printf("got value %d instead of %d\n", prev, i-1); mode=0; } } prev=j; } } Here's one sample output, where there are two separate corruptions: gaur.aciri.org: ./unfoo3 < t4 ----------------------------- problem start at word: 114561360 got value 909456435 instead of 114561360 got value 171522103 instead of 114561361 got value 875770417 instead of 114561362 got value 943142453 instead of 114561363 got value 842074681 instead of 114561364 got value 909456435 instead of 114561365 got value 171522103 instead of 114561366 got value 875770417 instead of 114561367 got value 943142453 instead of 114561368 got value 842074681 instead of 114561369 got value 909456435 instead of 114561370 got value 171522103 instead of 114561371 got value 875770417 instead of 114561372 got value 943142453 instead of 114561373 got value 842074681 instead of 114561374 got value 909456435 instead of 114561375 ----------------------------- last word of problem : 114561375 got value 909456435 instead of 114561375 ----------------------------- problem start at word: 237338864 got value 112460016 instead of 237338864 got value 112460017 instead of 237338865 got value 112460018 instead of 237338866 got value 112460019 instead of 237338867 got value 112460020 instead of 237338868 got value 112460021 instead of 237338869 got value 112460022 instead of 237338870 got value 112460023 instead of 237338871 got value 112460024 instead of 237338872 got value 112460025 instead of 237338873 got value 112460026 instead of 237338874 got value 112460027 instead of 237338875 got value 112460028 instead of 237338876 got value 112460029 instead of 237338877 got value 112460030 instead of 237338878 got value 112460031 instead of 237338879 ----------------------------- last word of problem : 237338879 got value 112460031 instead of 237338879 In this case, there are two corruptions. The first corruption seems to be some random chunk of data; the second (more typical) corruption seems to be a copy of an earlier piece of the file. In most cases, the corruption seems to be of a 64-byte chunk of the file replaced with some other data, typically (but not always) an earlier chunk of the same file. I've never seen more than 64 bytes corrupted, but on one of the machines I've seen smaller corruptions. I originally thought this was a hardware problem, but I've reproduced it on the three identical machines I've tried, so if it is a hardware fault, it's in the whole batch. I've also tried to reproduce it on an additional 1GHz Athlon/A7V machine with a Adaptec 2940 SCSI adaptor, but that machine doesn't suffer from the same problem, so I'm beginning to suspect an interaction between the Adaptec 29160 driver and the filesystem when writing large files as being a possible cause. Here's the dmesg.boot from one of the problem machines in case it helps. gaur.aciri.org: more /var/run/dmesg.boot Copyright (c) 1992-2000 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 4.2-RELEASE #1: Sat Jan 20 20:49:54 PST 2001 root@gaur.aciri.org:/usr/src/sys/compile/ACIRI-4.2-USB Timecounter "i8254" frequency 1193182 Hz CPU: AMD Athlon(tm) Processor (1109.89-MHz 686-class CPU) Origin = "AuthenticAMD" Id = 0x642 Stepping = 2 Features=0x183f9ff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR> AMD Features=0xc0440000<<b18>,AMIE,DSP,3DNow!> real memory = 536788992 (524208K bytes) avail memory = 518864896 (506704K bytes) Preloaded elf kernel "kernel" at 0xc03c8000. Pentium Pro MTRR support enabled md0: Malloc disk npx0: <math processor> on motherboard npx0: INT 16 interface pcib0: <Host to PCI bridge> on motherboard pci0: <PCI bus> on pcib0 pcib2: <PCI to PCI bridge (vendor=1106 device=8305)> at device 1.0 on pci0 pci1: <PCI bus> on pcib2 isab0: <VIA 82C686 PCI-ISA bridge> at device 4.0 on pci0 isa0: <ISA bus> on isab0 atapci0: <VIA 82C686 ATA66 controller> port 0xd800-0xd80f at device 4.1 on pci0 ata1: at 0x170 irq 15 on atapci0 pci0: <VIA 83C572 USB controller> at 4.2 irq 12 pci0: <VIA 83C572 USB controller> at 4.3 irq 12 fxp0: <Intel Pro 10/100B/100+ Ethernet> port 0xa400-0xa43f mem 0xd6800000-0xd68fffff,0xd7000000-0xd7000fff irq 10 at device 11.0 on pci0 fxp0: Ethernet address 00:02:b3:10:b4:67 pci0: <3D Labs model 000a graphics accelerator> at 12.0 irq 11 ahc0: <Adaptec 29160 Ultra160 SCSI adapter> port 0xa000-0xa0ff mem 0xd5800000-0xd5800fff irq 12 at device 13.0 on pci0 aic7892: Wide Channel A, SCSI Id=7, 32/255 SCBs atapci1: <Promise ATA100 controller> port 0x8400-0x843f,0x8800-0x8803,0x9000-0x9007,0x9400-0x9403,0x9800-0x9807 mem 0xd5000000-0xd501ffff irq 10 at device 17.0 on pci0 pcib1: <Host to PCI bridge> on motherboard pci2: <PCI bus> on pcib1 fdc0: <NEC 72065B or clone> at port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on isa0 fdc0: FIFO enabled, 8 bytes threshold fd0: <1440-KB 3.5" drive> on fdc0 drive 0 atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0 vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 sc0: <System console> at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x300> sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0 sio0: type 16550A sio1 at port 0x2f8-0x2ff irq 3 on isa0 sio1: type 16550A DUMMYNET initialized (000608) IP packet filtering initialized, divert disabled, rule-based forwarding disabled, default to deny, logging disabled acd0: CDROM <SONY CDU4811> at ata1-master using PIO4 Waiting 5 seconds for SCSI devices to settle Mounting root from ufs:/dev/da0s1a da0 at ahc0 bus 0 target 0 lun 0 da0: <SEAGATE ST318451LW 0003> Fixed Direct Access SCSI-3 device da0: 160.000MB/s transfers (80.000MHz, offset 63, 16bit), Tagged Queueing Enabled da0: 17501MB (35843671 512 byte sectors: 255H 63S/T 2231C) >How-To-Repeat: Write several very large files rapidly (see above). Some fraction of them will be corrupted (I see between 5% and 25% of 512MB files get corrupted). >Fix: >Release-Note: >Audit-Trail: >Unformatted: To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-bugs" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200102142235.f1EMZlP79180>