From owner-freebsd-scsi Wed Jan 31 14:53:39 2001 Delivered-To: freebsd-scsi@freebsd.org Received: from salmon.maths.tcd.ie (salmon.maths.tcd.ie [134.226.81.11]) by hub.freebsd.org (Postfix) with SMTP id 548A437B491 for ; Wed, 31 Jan 2001 14:53:12 -0800 (PST) Received: from walton.maths.tcd.ie by salmon.maths.tcd.ie with SMTP id ; 31 Jan 2001 22:53:10 +0000 (GMT) To: scsi@freebsd.org Cc: iedowse@maths.tcd.ie Subject: Corruption on ahc reads - seems PCI latency related Date: Wed, 31 Jan 2001 22:53:10 +0000 From: Ian Dowse Message-ID: <200101312253.aa86550@salmon.maths.tcd.ie> Sender: owner-freebsd-scsi@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org We have a heavily loaded 4.2-STABLE NFS fileserver machine that has recently delevoped a file corruption problem. The corruption seems to be occurring during reads from one SCSI disk (da0). It appears that small regions (usually 18 bytes) of a read are 'missed', so the buffer cache ends up with mostly the new data, but some bytes are from whatever happened to be in the buffer cache before the read. Here's an example of the corruption on an executable: --- good Fri Jan 19 21:46:18 2001 +++ corrupted Fri Jan 19 21:46:18 2001 @@ -5468,4 +5468,4 @@ 000155d0 25 73 00 43 6f 75 6c 64 20 6e 6f 74 20 6f 70 65 |%s.Could not ope| -000155e0 6e 20 68 6f 73 74 00 61 50 3a 75 3a 70 3a 4a 3a |n host.aP:u:p:J:| -000155f0 72 64 3a 67 3a 00 4d 61 6c 66 6f 72 6d 65 64 20 |rd:g:.Malformed | +000155e0 6e 20 68 6f 73 74 00 61 50 3a 75 3a 70 3a 6d 40 |n host.aP:u:p:m@| +000155f0 2f 36 df 7e 01 d9 6d 40 e0 ef 12 20 fd ce 6d 40 |/6.~..m@... ..m@| 00015600 55 52 4c 3a 20 25 73 0a 00 00 00 00 00 00 00 00 |URL: %s.........| All the examples I have seen involve the last few bytes of a 512-byte block. Sample offsets are 0x1dee, 0x15ee, 0x1df0, 0x15f0, 0x195ee. In the above example, the junk in place of the real data happens to be from a Matlab data file that was written from an NFS client to a different local disk (da2). No corruption was seen in the Matlab data file. I am able to repeat this corruption by doing the following: # clear out buffer cache perl -e '$_ = "x" x 12800000' # start a continuous write from an NFS client to da2 rsh client "cat hugefile > /server_da2/file" # /usr/local is on da0 md5 /usr/local/bin/* | diff /tmp/good_md5.out - # examine resulting differences The odd thing is that we can only reproduce the corruption when reading from da0 (Quantum 9Gb), while writing over NFS to another disk (I have only tried da2). Swapping out da0 with another similar disk did not help. Anyway, today I tried fiddling with the PCI latency timer settings, and it seems that reducing the value of the ahc PCI latency timer makes the corruption go away. On this motherboard (Supermicro with onboard SCSI) the default PCI latency timer value on all devices is 0x40. If I reduce this to 0x20 on ahc0,ahc1,fxp0,fxp1,pcib1, then I can't repeat the corruption. When I put it back to 0x40 on ahc0 and ahc1 the corruption returns. Has anyone any ideas on what this might mean? If a FIFO somewhere is filling or a DMA is failing, shouldn't an error get back to the driver or OS somehow? Or is this just a sign of dying hardware? There were no hardware or software changes to the machine around the time that the corruption first appeared, but there could have been an increase in NFS load. Ian Copyright (c) 1992-2000 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 4.2-STABLE #0: Wed Jan 31 21:04:21 GMT 2001 iedowse@gosset.maths.tcd.ie:/mnt/obj/usr/src/sys/MACCULLAGH Timecounter "i8254" frequency 1193182 Hz Timecounter "TSC" frequency 451024893 Hz CPU: Pentium III/Pentium III Xeon/Celeron (451.02-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0x672 Stepping = 2 Features=0x387fbff real memory = 268369920 (262080K bytes) avail memory = 257765376 (251724K bytes) Preloaded elf kernel "kernel" at 0xc0352000. Preloaded elf module "green_saver.ko" at 0xc035209c. ccd0-3: Concatenated disk drivers Pentium Pro MTRR support enabled md0: Malloc disk npx0: on motherboard npx0: INT 16 interface pcib0: on motherboard pci0: on pcib0 pcib1: at device 1.0 on pci0 pci1: on pcib1 isab0: at device 7.0 on pci0 isa0: on isab0 atapci0: port 0xf000-0xf00f at device 7.1 on pci0 ata0: at 0x1f0 irq 14 on atapci0 ata1: at 0x170 irq 15 on atapci0 pci0: at 7.2 chip1: port 0x5000-0x500f at device 7.3 on pci0 pci0: at 15.0 irq 11 ahc0: port 0xd400-0xd4ff mem 0xed200000-0xed200fff irq 10 at device 16.0 on pci0 aic7890/91: Wide Channel A, SCSI Id=7, 32/255 SCBs fxp0: port 0xd800-0xd81f mem 0xed000000-0xed0fffff,0xed201000-0xed201fff irq 12 at device 18.0 on pci0 fxp0: Ethernet address 00:90:27:12:56:c5 fxp1: port 0xdc00-0xdc1f mem 0xed100000-0xed1fffff,0xed203000-0xed203fff irq 5 at device 19.0 on pci0 fxp1: Ethernet address 00:90:27:1d:1f:0b ahc1: port 0xe000-0xe0ff mem 0xed202000-0xed202fff irq 5 at device 20.0 on pci0 aic7890/91: Wide Channel A, SCSI Id=7, 32/255 SCBs fdc0: at port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on isa0 fdc0: FIFO enabled, 8 bytes threshold fd0: <1440-KB 3.5" drive> on fdc0 drive 0 atkbdc0: at port 0x60,0x64 on isa0 atkbd0: flags 0x1 irq 1 on atkbdc0 kbd0 at atkbd0 vga0: at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 sc0: at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x300> sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0 sio0: type 16550A sio1 at port 0x2f8-0x2ff irq 3 on isa0 sio1: type 16550A ppc0: at port 0x378-0x37f irq 7 on isa0 ppc0: SMC-like chipset (ECP/EPP/PS2/NIBBLE) in COMPATIBLE mode ppc0: FIFO with 16/16/16 bytes threshold plip0: on ppbus0 lpt0: on ppbus0 lpt0: Interrupt-driven port ppi0: on ppbus0 IP packet filtering initialized, divert disabled, rule-based forwarding enabled, default to accept, unlimited logging acd0: CDROM at ata0-master using PIO4 Waiting 5 seconds for SCSI devices to settle Mounting root from ufs:/dev/da0s1a da0 at ahc0 bus 0 target 0 lun 0 da0: Fixed Direct Access SCSI-2 device da0: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled da0: 8683MB (17783250 512 byte sectors: 255H 63S/T 1106C) da3 at ahc1 bus 0 target 3 lun 0 da3: Fixed Direct Access SCSI-3 device da3: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled da3: 17501MB (35843670 512 byte sectors: 255H 63S/T 2231C) da1 at ahc0 bus 0 target 1 lun 0 da1: Fixed Direct Access SCSI-3 device da1: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled da1: 17501MB (35843670 512 byte sectors: 255H 63S/T 2231C) da4 at ahc1 bus 0 target 4 lun 0 da4: Fixed Direct Access SCSI-3 device da4: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled da4: 17501MB (35843670 512 byte sectors: 255H 63S/T 2231C) da2 at ahc0 bus 0 target 2 lun 0 da2: Fixed Direct Access SCSI-3 device da2: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled da2: 17501MB (35843670 512 byte sectors: 255H 63S/T 2231C) To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-scsi" in the body of the message