Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 31 Jan 2001 22:53:10 +0000
From:      Ian Dowse <iedowse@maths.tcd.ie>
To:        scsi@freebsd.org
Cc:        iedowse@maths.tcd.ie
Subject:   Corruption on ahc reads - seems PCI latency related
Message-ID:   <200101312253.aa86550@salmon.maths.tcd.ie>

next in thread | raw e-mail | index | archive | help

We have a heavily loaded 4.2-STABLE NFS fileserver machine that
has recently delevoped a file corruption problem. The corruption
seems to be occurring during reads from one SCSI disk (da0). It
appears that small regions (usually 18 bytes) of a read are 'missed',
so the buffer cache ends up with mostly the new data, but some
bytes are from whatever happened to be in the buffer cache before
the read.

Here's an example of the corruption on an executable:

--- good Fri Jan 19 21:46:18 2001
+++ corrupted Fri Jan 19 21:46:18 2001
@@ -5468,4 +5468,4 @@
 000155d0  25 73 00 43 6f 75 6c 64  20 6e 6f 74 20 6f 70 65  |%s.Could not ope|
-000155e0  6e 20 68 6f 73 74 00 61  50 3a 75 3a 70 3a 4a 3a  |n host.aP:u:p:J:|
-000155f0  72 64 3a 67 3a 00 4d 61  6c 66 6f 72 6d 65 64 20  |rd:g:.Malformed |
+000155e0  6e 20 68 6f 73 74 00 61  50 3a 75 3a 70 3a 6d 40  |n host.aP:u:p:m@|
+000155f0  2f 36 df 7e 01 d9 6d 40  e0 ef 12 20 fd ce 6d 40  |/6.~..m@... ..m@|
 00015600  55 52 4c 3a 20 25 73 0a  00 00 00 00 00 00 00 00  |URL: %s.........|

All the examples I have seen involve the last few bytes of a 512-byte
block. Sample offsets are 0x1dee, 0x15ee, 0x1df0, 0x15f0, 0x195ee.
In the above example, the junk in place of the real data happens
to be from a Matlab data file that was written from an NFS client
to a different local disk (da2). No corruption was seen in the
Matlab data file.

I am able to repeat this corruption by doing the following:

	# clear out buffer cache
	perl -e '$_ = "x" x 12800000'

	# start a continuous write from an NFS client to da2
	rsh client "cat hugefile > /server_da2/file"

	# /usr/local is on da0
	md5 /usr/local/bin/* | diff /tmp/good_md5.out -

	# examine resulting differences

The odd thing is that we can only reproduce the corruption when
reading from da0 (Quantum 9Gb), while writing over NFS to another
disk (I have only tried da2). Swapping out da0 with another similar
disk did not help.

Anyway, today I tried fiddling with the PCI latency timer settings,
and it seems that reducing the value of the ahc PCI latency timer
makes the corruption go away. On this motherboard (Supermicro with
onboard SCSI) the default PCI latency timer value on all devices
is 0x40.  If I reduce this to 0x20 on ahc0,ahc1,fxp0,fxp1,pcib1,
then I can't repeat the corruption. When I put it back to 0x40 on
ahc0 and ahc1 the corruption returns.

Has anyone any ideas on what this might mean? If a FIFO somewhere
is filling or a DMA is failing, shouldn't an error get back to the
driver or OS somehow? Or is this just a sign of dying hardware?

There were no hardware or software changes to the machine around
the time that the corruption first appeared, but there could have
been an increase in NFS load. 

Ian

Copyright (c) 1992-2000 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
	The Regents of the University of California. All rights reserved.
FreeBSD 4.2-STABLE #0: Wed Jan 31 21:04:21 GMT 2001
    iedowse@gosset.maths.tcd.ie:/mnt/obj/usr/src/sys/MACCULLAGH
Timecounter "i8254"  frequency 1193182 Hz
Timecounter "TSC"  frequency 451024893 Hz
CPU: Pentium III/Pentium III Xeon/Celeron (451.02-MHz 686-class CPU)
  Origin = "GenuineIntel"  Id = 0x672  Stepping = 2
  Features=0x387fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,PN,MMX,FXSR,SSE>
real memory  = 268369920 (262080K bytes)
avail memory = 257765376 (251724K bytes)
Preloaded elf kernel "kernel" at 0xc0352000.
Preloaded elf module "green_saver.ko" at 0xc035209c.
ccd0-3: Concatenated disk drivers
Pentium Pro MTRR support enabled
md0: Malloc disk
npx0: <math processor> on motherboard
npx0: INT 16 interface
pcib0: <Intel 82443BX (440 BX) host to PCI bridge> on motherboard
pci0: <PCI bus> on pcib0
pcib1: <Intel 82443BX (440 BX) PCI-PCI (AGP) bridge> at device 1.0 on pci0
pci1: <PCI bus> on pcib1
isab0: <Intel 82371AB PCI to ISA bridge> at device 7.0 on pci0
isa0: <ISA bus> on isab0
atapci0: <Intel PIIX4 ATA33 controller> port 0xf000-0xf00f at device 7.1 on pci0
ata0: at 0x1f0 irq 14 on atapci0
ata1: at 0x170 irq 15 on atapci0
pci0: <Intel 82371AB/EB (PIIX4) USB controller> at 7.2
chip1: <Intel 82371AB Power management controller> port 0x5000-0x500f at device 7.3 on pci0
pci0: <S3 ViRGE DX/GX graphics accelerator> at 15.0 irq 11
ahc0: <Adaptec 2940 Ultra2 SCSI adapter> port 0xd400-0xd4ff mem 0xed200000-0xed200fff irq 10 at device 16.0 on pci0
aic7890/91: Wide Channel A, SCSI Id=7, 32/255 SCBs
fxp0: <Intel Pro 10/100B/100+ Ethernet> port 0xd800-0xd81f mem 0xed000000-0xed0fffff,0xed201000-0xed201fff irq 12 at device 18.0 on pci0
fxp0: Ethernet address 00:90:27:12:56:c5
fxp1: <Intel Pro 10/100B/100+ Ethernet> port 0xdc00-0xdc1f mem 0xed100000-0xed1fffff,0xed203000-0xed203fff irq 5 at device 19.0 on pci0
fxp1: Ethernet address 00:90:27:1d:1f:0b
ahc1: <Adaptec aic7890/91 Ultra2 SCSI adapter> port 0xe000-0xe0ff mem 0xed202000-0xed202fff irq 5 at device 20.0 on pci0
aic7890/91: Wide Channel A, SCSI Id=7, 32/255 SCBs
fdc0: <NEC 72065B or clone> at port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on isa0
fdc0: FIFO enabled, 8 bytes threshold
fd0: <1440-KB 3.5" drive> on fdc0 drive 0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
atkbd0: <AT Keyboard> flags 0x1 irq 1 on atkbdc0
kbd0 at atkbd0
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0
sio0: type 16550A
sio1 at port 0x2f8-0x2ff irq 3 on isa0
sio1: type 16550A
ppc0: <Parallel port> at port 0x378-0x37f irq 7 on isa0
ppc0: SMC-like chipset (ECP/EPP/PS2/NIBBLE) in COMPATIBLE mode
ppc0: FIFO with 16/16/16 bytes threshold
plip0: <PLIP network interface> on ppbus0
lpt0: <Printer> on ppbus0
lpt0: Interrupt-driven port
ppi0: <Parallel I/O> on ppbus0
IP packet filtering initialized, divert disabled, rule-based forwarding enabled, default to accept, unlimited logging
acd0: CDROM <BCD-40XH CD-ROM> at ata0-master using PIO4
Waiting 5 seconds for SCSI devices to settle
Mounting root from ufs:/dev/da0s1a
da0 at ahc0 bus 0 target 0 lun 0
da0: <QUANTUM QM39100TD-SW N491> Fixed Direct Access SCSI-2 device 
da0: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled
da0: 8683MB (17783250 512 byte sectors: 255H 63S/T 1106C)
da3 at ahc1 bus 0 target 3 lun 0
da3: <IBM DMVS18V 0250> Fixed Direct Access SCSI-3 device 
da3: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled
da3: 17501MB (35843670 512 byte sectors: 255H 63S/T 2231C)
da1 at ahc0 bus 0 target 1 lun 0
da1: <IBM DMVS18V 0250> Fixed Direct Access SCSI-3 device 
da1: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled
da1: 17501MB (35843670 512 byte sectors: 255H 63S/T 2231C)
da4 at ahc1 bus 0 target 4 lun 0
da4: <IBM DMVS18V 0250> Fixed Direct Access SCSI-3 device 
da4: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled
da4: 17501MB (35843670 512 byte sectors: 255H 63S/T 2231C)
da2 at ahc0 bus 0 target 2 lun 0
da2: <IBM DMVS18V 0250> Fixed Direct Access SCSI-3 device 
da2: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled
da2: 17501MB (35843670 512 byte sectors: 255H 63S/T 2231C)


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-scsi" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi? <200101312253.aa86550>