Date: 31 Jan 2001 19:21:07 -0500 From: Andrew Heybey <ath@niksun.com> To: freebsd-scsi@freebsd.org Cc: Ian Dowse <iedowse@maths.tcd.ie> Subject: Re: Corruption on ahc reads - seems PCI latency related Message-ID: <85r91jqmmj.fsf@stiegl.niksun.com> In-Reply-To: Ian Dowse's message of "Wed, 31 Jan 2001 22:53:10 %2B0000" References: <200101312253.aa86550@salmon.maths.tcd.ie>
next in thread | previous in thread | raw e-mail | index | archive | help
Ian Dowse <iedowse@maths.tcd.ie> writes:
> We have a heavily loaded 4.2-STABLE NFS fileserver machine that
> has recently delevoped a file corruption problem. The corruption
> seems to be occurring during reads from one SCSI disk (da0). It
> appears that small regions (usually 18 bytes) of a read are 'missed',
> so the buffer cache ends up with mostly the new data, but some
> bytes are from whatever happened to be in the buffer cache before
> the read.
>
[...]
> The odd thing is that we can only reproduce the corruption when
> reading from da0 (Quantum 9Gb), while writing over NFS to another
> disk (I have only tried da2). Swapping out da0 with another similar
> disk did not help.
>
> Anyway, today I tried fiddling with the PCI latency timer settings,
> and it seems that reducing the value of the ahc PCI latency timer
> makes the corruption go away. On this motherboard (Supermicro with
> onboard SCSI) the default PCI latency timer value on all devices
> is 0x40. If I reduce this to 0x20 on ahc0,ahc1,fxp0,fxp1,pcib1,
> then I can't repeat the corruption. When I put it back to 0x40 on
> ahc0 and ahc1 the corruption returns.
>
> Has anyone any ideas on what this might mean? If a FIFO somewhere
> is filling or a DMA is failing, shouldn't an error get back to the
> driver or OS somehow? Or is this just a sign of dying hardware?
This sounds almost exactly like a problem I had with 3.1 in 1999.
Under heavy disk and network load I would see exactly this problem.
Fiddling with the PCI latency registers seemed to fix the problem at
first but then it came back. See kern/10243. However (as noted at
the end of the PR) my problem went away with
sys/dev/aic7xxx/aic7xxx.seq revision 1.91.
Looking at the diffs from 1.90 to 1.91, the fix for the bug is:
+ultra2_dmafifoflush:
or DFCNTRL, FIFOFLUSH;
- test DFSTATUS, FIFOEMP jz . - 1;
+ /*
+ * The FIFOEMP status bit on the Ultra2 class
+ * of controllers seems to be a bit flaky.
+ * It appears that if the FIFO is full and the
+ * transfer ends with some data in the REQ/ACK
+ * FIFO, FIFOEMP will fall temporarily
+ * as the data is transferred to the PCI bus.
+ * This glitch lasts for fewer than 5 clock cycles,
+ * so we work around the problem by ensuring the
+ * status bit stays false through a full glitch
+ * window.
+ */
+ test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush;
+ test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush;
+ test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush;
+ test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush;
+ test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush;
+
+ultra2_dmafifoempty:
+ /* Don't clobber an inprogress host data transfer */
+ test DFSTATUS, MREQPEND jnz ultra2_dmafifoempty;
+
In -stable, the corresponding code seems to be (rev 1.94.2.8):
ultra2_dmafifoflush:
if ((ahc->bugs & AHC_AUTOFLUSH_BUG) != 0) {
/*
* On Rev A of the aic7890, the autoflush
* features doesn't function correctly.
* Perform an explicit manual flush. During
* a manual flush, the FIFOEMP bit becomes
* true every time the PCI FIFO empties
* regardless of the state of the SCSI FIFO.
* It can take up to 4 clock cycles for the
* SCSI FIFO to get data into the PCI FIFO
* and for FIFOEMP to de-assert. Here we
* guard against this condition by making
* sure the FIFOEMP bit stays on for 5 full
* clock cycles.
*/
or DFCNTRL, FIFOFLUSH;
test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush;
test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush;
test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush;
test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush;
}
test DFSTATUS, FIFOEMP jz ultra2_dmafifoflush;
Maybe AHC_AUTOFLUSH_BUG does not get set for all the chips that
actually have the bug? That is a WAG, since I am by no means an ahc
expert.
andrew
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-scsi" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?85r91jqmmj.fsf>
