Date: Wed, 31 Jan 2001 22:53:10 +0000 From: Ian Dowse <iedowse@maths.tcd.ie> To: scsi@freebsd.org Cc: iedowse@maths.tcd.ie Subject: Corruption on ahc reads - seems PCI latency related Message-ID: <200101312253.aa86550@salmon.maths.tcd.ie>
next in thread | raw e-mail | index | archive | help
We have a heavily loaded 4.2-STABLE NFS fileserver machine that has recently delevoped a file corruption problem. The corruption seems to be occurring during reads from one SCSI disk (da0). It appears that small regions (usually 18 bytes) of a read are 'missed', so the buffer cache ends up with mostly the new data, but some bytes are from whatever happened to be in the buffer cache before the read. Here's an example of the corruption on an executable: --- good Fri Jan 19 21:46:18 2001 +++ corrupted Fri Jan 19 21:46:18 2001 @@ -5468,4 +5468,4 @@ 000155d0 25 73 00 43 6f 75 6c 64 20 6e 6f 74 20 6f 70 65 |%s.Could not ope| -000155e0 6e 20 68 6f 73 74 00 61 50 3a 75 3a 70 3a 4a 3a |n host.aP:u:p:J:| -000155f0 72 64 3a 67 3a 00 4d 61 6c 66 6f 72 6d 65 64 20 |rd:g:.Malformed | +000155e0 6e 20 68 6f 73 74 00 61 50 3a 75 3a 70 3a 6d 40 |n host.aP:u:p:m@| +000155f0 2f 36 df 7e 01 d9 6d 40 e0 ef 12 20 fd ce 6d 40 |/6.~..m@... ..m@| 00015600 55 52 4c 3a 20 25 73 0a 00 00 00 00 00 00 00 00 |URL: %s.........| All the examples I have seen involve the last few bytes of a 512-byte block. Sample offsets are 0x1dee, 0x15ee, 0x1df0, 0x15f0, 0x195ee. In the above example, the junk in place of the real data happens to be from a Matlab data file that was written from an NFS client to a different local disk (da2). No corruption was seen in the Matlab data file. I am able to repeat this corruption by doing the following: # clear out buffer cache perl -e '$_ = "x" x 12800000' # start a continuous write from an NFS client to da2 rsh client "cat hugefile > /server_da2/file" # /usr/local is on da0 md5 /usr/local/bin/* | diff /tmp/good_md5.out - # examine resulting differences The odd thing is that we can only reproduce the corruption when reading from da0 (Quantum 9Gb), while writing over NFS to another disk (I have only tried da2). Swapping out da0 with another similar disk did not help. Anyway, today I tried fiddling with the PCI latency timer settings, and it seems that reducing the value of the ahc PCI latency timer makes the corruption go away. On this motherboard (Supermicro with onboard SCSI) the default PCI latency timer value on all devices is 0x40. If I reduce this to 0x20 on ahc0,ahc1,fxp0,fxp1,pcib1, then I can't repeat the corruption. When I put it back to 0x40 on ahc0 and ahc1 the corruption returns. Has anyone any ideas on what this might mean? If a FIFO somewhere is filling or a DMA is failing, shouldn't an error get back to the driver or OS somehow? Or is this just a sign of dying hardware? There were no hardware or software changes to the machine around the time that the corruption first appeared, but there could have been an increase in NFS load. Ian Copyright (c) 1992-2000 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 4.2-STABLE #0: Wed Jan 31 21:04:21 GMT 2001 iedowse@gosset.maths.tcd.ie:/mnt/obj/usr/src/sys/MACCULLAGH Timecounter "i8254" frequency 1193182 Hz Timecounter "TSC" frequency 451024893 Hz CPU: Pentium III/Pentium III Xeon/Celeron (451.02-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0x672 Stepping = 2 Features=0x387fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,PN,MMX,FXSR,SSE> real memory = 268369920 (262080K bytes) avail memory = 257765376 (251724K bytes) Preloaded elf kernel "kernel" at 0xc0352000. Preloaded elf module "green_saver.ko" at 0xc035209c. ccd0-3: Concatenated disk drivers Pentium Pro MTRR support enabled md0: Malloc disk npx0: <math processor> on motherboard npx0: INT 16 interface pcib0: <Intel 82443BX (440 BX) host to PCI bridge> on motherboard pci0: <PCI bus> on pcib0 pcib1: <Intel 82443BX (440 BX) PCI-PCI (AGP) bridge> at device 1.0 on pci0 pci1: <PCI bus> on pcib1 isab0: <Intel 82371AB PCI to ISA bridge> at device 7.0 on pci0 isa0: <ISA bus> on isab0 atapci0: <Intel PIIX4 ATA33 controller> port 0xf000-0xf00f at device 7.1 on pci0 ata0: at 0x1f0 irq 14 on atapci0 ata1: at 0x170 irq 15 on atapci0 pci0: <Intel 82371AB/EB (PIIX4) USB controller> at 7.2 chip1: <Intel 82371AB Power management controller> port 0x5000-0x500f at device 7.3 on pci0 pci0: <S3 ViRGE DX/GX graphics accelerator> at 15.0 irq 11 ahc0: <Adaptec 2940 Ultra2 SCSI adapter> port 0xd400-0xd4ff mem 0xed200000-0xed200fff irq 10 at device 16.0 on pci0 aic7890/91: Wide Channel A, SCSI Id=7, 32/255 SCBs fxp0: <Intel Pro 10/100B/100+ Ethernet> port 0xd800-0xd81f mem 0xed000000-0xed0fffff,0xed201000-0xed201fff irq 12 at device 18.0 on pci0 fxp0: Ethernet address 00:90:27:12:56:c5 fxp1: <Intel Pro 10/100B/100+ Ethernet> port 0xdc00-0xdc1f mem 0xed100000-0xed1fffff,0xed203000-0xed203fff irq 5 at device 19.0 on pci0 fxp1: Ethernet address 00:90:27:1d:1f:0b ahc1: <Adaptec aic7890/91 Ultra2 SCSI adapter> port 0xe000-0xe0ff mem 0xed202000-0xed202fff irq 5 at device 20.0 on pci0 aic7890/91: Wide Channel A, SCSI Id=7, 32/255 SCBs fdc0: <NEC 72065B or clone> at port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on isa0 fdc0: FIFO enabled, 8 bytes threshold fd0: <1440-KB 3.5" drive> on fdc0 drive 0 atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0 atkbd0: <AT Keyboard> flags 0x1 irq 1 on atkbdc0 kbd0 at atkbd0 vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 sc0: <System console> at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x300> sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0 sio0: type 16550A sio1 at port 0x2f8-0x2ff irq 3 on isa0 sio1: type 16550A ppc0: <Parallel port> at port 0x378-0x37f irq 7 on isa0 ppc0: SMC-like chipset (ECP/EPP/PS2/NIBBLE) in COMPATIBLE mode ppc0: FIFO with 16/16/16 bytes threshold plip0: <PLIP network interface> on ppbus0 lpt0: <Printer> on ppbus0 lpt0: Interrupt-driven port ppi0: <Parallel I/O> on ppbus0 IP packet filtering initialized, divert disabled, rule-based forwarding enabled, default to accept, unlimited logging acd0: CDROM <BCD-40XH CD-ROM> at ata0-master using PIO4 Waiting 5 seconds for SCSI devices to settle Mounting root from ufs:/dev/da0s1a da0 at ahc0 bus 0 target 0 lun 0 da0: <QUANTUM QM39100TD-SW N491> Fixed Direct Access SCSI-2 device da0: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled da0: 8683MB (17783250 512 byte sectors: 255H 63S/T 1106C) da3 at ahc1 bus 0 target 3 lun 0 da3: <IBM DMVS18V 0250> Fixed Direct Access SCSI-3 device da3: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled da3: 17501MB (35843670 512 byte sectors: 255H 63S/T 2231C) da1 at ahc0 bus 0 target 1 lun 0 da1: <IBM DMVS18V 0250> Fixed Direct Access SCSI-3 device da1: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled da1: 17501MB (35843670 512 byte sectors: 255H 63S/T 2231C) da4 at ahc1 bus 0 target 4 lun 0 da4: <IBM DMVS18V 0250> Fixed Direct Access SCSI-3 device da4: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled da4: 17501MB (35843670 512 byte sectors: 255H 63S/T 2231C) da2 at ahc0 bus 0 target 2 lun 0 da2: <IBM DMVS18V 0250> Fixed Direct Access SCSI-3 device da2: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled da2: 17501MB (35843670 512 byte sectors: 255H 63S/T 2231C) To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-scsi" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi? <200101312253.aa86550>