Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 20 Sep 2004 13:25:39 -0700
From:      "Kevin Oberman" <oberman@es.net>
To:        current@freebsd.org
Subject:   Re: ad0: TIMEOUT - READ_DMA retrying (2 retries left) LBA=207594611
Message-ID:  <20040920202539.10F925D0A@ptavv.es.net>

next in thread | raw e-mail | index | archive | help
Well, I spent the weekend building systems and kernels and I can now be
pretty sure that this is a timing related issue.

I had previously reported that I could create the problem by starting my
xl Ethernet card. I have since learned that the issues are not closely
coupled but are a problem with the Ethernet triggering the problem with
the disks.

First, the problem I am seeing with the xl is long standing. It appeared
on about June 30 or July 1. (It will take a few more kernels to track it
down further and I m on the road for a few days and can't play with the
system.) But even prior to that date, if I disable ACPI, the same
behavior shows up. (Dead Ethernet and continual 'xl0: watchdog timeout"
messages.) I ave no idea when the problem started when ACPI is disabled.

The xl0 problems causes the system to pause and, after some changes to
the kernel in late July or early August, the problems with ATA joined
the xl0 problem. If i turn off xl0 (and, probably if the xl0 problem was
fixed), the disk errors go away. Because of this, I suspect that the
added delays cause by the xl0 timeouts are actually triggering the ATA
timeouts. Since others are seeing the same error under heavy load, I
imagine that other things can trigger the same DMA timeouts on ATA.

When I get home, I'll try to figure out exactly which patch is causing
the problem with the xl and then go after the patch that caused the ata
error. I can say that it shows up earlier (with the xl to trigger it)
than others have reported. I think it started in early August, but I ma
sure it was present in RELENG_5 by August 15 and was probably present
when RELENG_5 was branched.

Sorry that I ran out of time before I could track this down better, but
I hope this helps and I'll continue tracking the exact failures when I
get home.
-- 
R. Kevin Oberman, Network Engineer
Energy Sciences Network (ESnet)
Ernest O. Lawrence Berkeley National Laboratory (Berkeley Lab)
E-mail: oberman@es.net			Phone: +1 510 486-8634



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20040920202539.10F925D0A>