From owner-freebsd-scsi Wed Jan 31 16:22:33 2001 Delivered-To: freebsd-scsi@freebsd.org Received: from arjun.niksun.com (unknown [63.148.27.34]) by hub.freebsd.org (Postfix) with ESMTP id 2829637B4EC for ; Wed, 31 Jan 2001 16:22:11 -0800 (PST) Received: from stiegl.niksun.com (stiegl.niksun.com [10.0.0.44]) by arjun.niksun.com (8.9.3/8.9.3) with ESMTP id TAA64470; Wed, 31 Jan 2001 19:19:40 -0500 (EST) (envelope-from ath@stiegl.niksun.com) Received: (from ath@localhost) by stiegl.niksun.com (8.9.2/8.8.7) id TAA81554; Wed, 31 Jan 2001 19:21:08 -0500 (EST) (envelope-from ath) To: freebsd-scsi@freebsd.org Cc: Ian Dowse Subject: Re: Corruption on ahc reads - seems PCI latency related References: <200101312253.aa86550@salmon.maths.tcd.ie> From: Andrew Heybey Date: 31 Jan 2001 19:21:07 -0500 In-Reply-To: Ian Dowse's message of "Wed, 31 Jan 2001 22:53:10 +0000" Message-ID: <85r91jqmmj.fsf@stiegl.niksun.com> Lines: 96 X-Mailer: Gnus v5.5/XEmacs 20.4 - "Emerald" Sender: owner-freebsd-scsi@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Ian Dowse writes: > We have a heavily loaded 4.2-STABLE NFS fileserver machine that > has recently delevoped a file corruption problem. The corruption > seems to be occurring during reads from one SCSI disk (da0). It > appears that small regions (usually 18 bytes) of a read are 'missed', > so the buffer cache ends up with mostly the new data, but some > bytes are from whatever happened to be in the buffer cache before > the read. > [...] > The odd thing is that we can only reproduce the corruption when > reading from da0 (Quantum 9Gb), while writing over NFS to another > disk (I have only tried da2). Swapping out da0 with another similar > disk did not help. > > Anyway, today I tried fiddling with the PCI latency timer settings, > and it seems that reducing the value of the ahc PCI latency timer > makes the corruption go away. On this motherboard (Supermicro with > onboard SCSI) the default PCI latency timer value on all devices > is 0x40. If I reduce this to 0x20 on ahc0,ahc1,fxp0,fxp1,pcib1, > then I can't repeat the corruption. When I put it back to 0x40 on > ahc0 and ahc1 the corruption returns. > > Has anyone any ideas on what this might mean? If a FIFO somewhere > is filling or a DMA is failing, shouldn't an error get back to the > driver or OS somehow? Or is this just a sign of dying hardware? This sounds almost exactly like a problem I had with 3.1 in 1999. Under heavy disk and network load I would see exactly this problem. Fiddling with the PCI latency registers seemed to fix the problem at first but then it came back. See kern/10243. However (as noted at the end of the PR) my problem went away with sys/dev/aic7xxx/aic7xxx.seq revision 1.91. Looking at the diffs from 1.90 to 1.91, the fix for the bug is: +ultra2_dmafifoflush: or DFCNTRL, FIFOFLUSH; - test DFSTATUS, FIFOEMP jz . - 1; + /* + * The FIFOEMP status bit on the Ultra2 class + * of controllers seems to be a bit flaky. + * It appears that if the FIFO is full and the + * transfer ends with some data in the REQ/ACK + * FIFO, FIFOEMP will fall temporarily + * as the data is transferred to the PCI bus. + * This glitch lasts for fewer than 5 clock cycles, + * so we work around the problem by ensuring the + * status bit stays false through a full glitch + * window. + */ + test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush; + test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush; + test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush; + test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush; + test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush; + +ultra2_dmafifoempty: + /* Don't clobber an inprogress host data transfer */ + test DFSTATUS, MREQPEND jnz ultra2_dmafifoempty; + In -stable, the corresponding code seems to be (rev 1.94.2.8): ultra2_dmafifoflush: if ((ahc->bugs & AHC_AUTOFLUSH_BUG) != 0) { /* * On Rev A of the aic7890, the autoflush * features doesn't function correctly. * Perform an explicit manual flush. During * a manual flush, the FIFOEMP bit becomes * true every time the PCI FIFO empties * regardless of the state of the SCSI FIFO. * It can take up to 4 clock cycles for the * SCSI FIFO to get data into the PCI FIFO * and for FIFOEMP to de-assert. Here we * guard against this condition by making * sure the FIFOEMP bit stays on for 5 full * clock cycles. */ or DFCNTRL, FIFOFLUSH; test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush; test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush; test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush; test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush; } test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush; Maybe AHC_AUTOFLUSH_BUG does not get set for all the chips that actually have the bug? That is a WAG, since I am by no means an ahc expert. andrew To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-scsi" in the body of the message