Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 18 Jun 2003 00:53:18 +1000
From:      Stephen McKay <smckay@internode.on.net>
To:        freebsd-hardware@freebsd.org
Cc:        Stephen McKay <smckay@internode.on.net>
Subject:   ATA READ command timeout (and worse)
Message-ID:  <200306171453.h5HErICg012759@dungeon.home>

next in thread | raw e-mail | index | archive | help
I've been plagued by ATA READ command timeouts on my home NFS server.
Tonight I decided to get to the bottom of it, so I started torturing
it. :-)

The machine is a Celeron 400 with 256MB ram in a Soyo 7IZB motherboard.
There are 4 disks (1 on the M/B, and 3 on a SiI 0680).  There are also
2 NICs and a Trio64 in there.  It's pretty cramped, but I know it's not
overheating because I've got a thermometer in there too.  (And it's
winter here.)  It can serve up 15MB/s of NFS traffic (though 2 NICs),
which isn't bad for an old CPU.

Here's a selection from dmesg:

Jun 17 23:54:11 peon /kernel: FreeBSD 4.7-RELEASE-p10 #15: Mon Mar 31 23:21:34 E
ST 2003
Jun 17 23:54:11 peon /kernel: atapci0: <Intel PIIX4 ATA33 controller> port 0xf00
0-0xf00f at device 7.1 on pci0
Jun 17 23:54:11 peon /kernel: ata0: at 0x1f0 irq 14 on atapci0
Jun 17 23:54:11 peon /kernel: ata1: at 0x170 irq 15 on atapci0
Jun 17 23:54:11 peon /kernel: atapci1: <Sil 0680 ATA133 controller> port 0xe400-
0xe40f,0xe000-0xe003,0xdc00-0xdc07,0xd800-0xd803,0xd400-0xd407 mem 0xec001000-0x
ec0010ff irq 11 at device 15.0 on pci0
Jun 17 23:54:11 peon /kernel: ata2: at 0xd400 on atapci1
Jun 17 23:54:11 peon /kernel: ata3: at 0xdc00 on atapci1
Jun 17 23:54:11 peon /kernel: ad0: 19470MB <IBM-DJNA-352030> [39560/16/63] at at
a0-master UDMA33
Jun 17 23:54:11 peon /kernel: ad4: 78533MB <IC35L080AVVA07-0> [159560/16/63] at 
ata2-master UDMA100
Jun 17 23:54:11 peon /kernel: ad5: 39266MB <IBM-DTLA-305040> [79780/16/63] at at
a2-slave UDMA100
Jun 17 23:54:11 peon /kernel: ad6: 117246MB <Maxtor 6Y120L0> [238216/16/63] at a
ta3-master UDMA133

My problem is I get this sort of thing frequently:

Jun 17 22:28:02 peon /kernel: ad6: READ command timeout tag=0 serv=0 - resetting
Jun 17 22:28:02 peon /kernel: ata3: resetting devices .. done
...
Jun 17 23:05:03 peon /kernel: ad4: READ command timeout tag=0 serv=0 - resetting
Jun 17 23:05:03 peon /kernel: ata2: resetting devices .. ad4: DMA limited to UDMA33, non-ATA66 cable or device
Jun 17 23:05:03 peon /kernel: done

I also get this sometimes:

Jun 17 23:47:53 peon /kernel: ad4: no status, reselecting device
Jun 17 23:48:03 peon /kernel: ad4: READ command timeout tag=0 serv=0 - resetting
Jun 17 23:48:03 peon /kernel: ata2: resetting devices .. ad4: DMA limited to UDMA33, non-ATA66 cable or device
Jun 17 23:48:03 peon /kernel: done

Note the cable most certainly IS ATA66 (and ATA100) capable, and correctly
connected.  Note that at boot time ad4 is NOT limited to UDMA33.

I have changed the cables and this made no difference.  I've tried manually
limiting the speed (with atacontrol) to UDMA33 and that made no difference.

During the time the read commmand is timing out there is no data flow
(as shown by systat) for 10 seconds, and this matches the timeout in
the ata code.  My first theory was that the driver wasn't giving the read
command a decent chance.  The evidence is against that theory. :-)

I tried accessing just a single disk at one time and was NOT able to make
any individual disk fail.  When I made two disks busy (by running diff over
6GB of fairly large files) there was at least one failure, and sometimes
as many as 6 failures per test run.  One test run caused a spectacular
failure:

Jun 17 23:49:48 peon /kernel: ad4: no status, reselecting device
Jun 17 23:49:48 peon /kernel: ad4: timeout sending command=c8 s=ff e=00
Jun 17 23:49:48 peon /kernel: ad4: error executing command - resetting
Jun 17 23:49:48 peon /kernel: ata2: resetting devices .. 
Jun 17 23:49:48 peon /kernel: ad4: removed from configuration
Jun 17 23:49:48 peon /kernel: ad5: removed from configuration
Jun 17 23:49:48 peon /kernel: done

At this point, the machine spontaneously rebooted!

So, what should I conclude?  I can think of only two reasonable
possibilities: there is a race condition in the ata driver causing
command completions to occasionally get lost, or my power supply
just can't keep up with two disks reading simultaneously.

The power supply has 230W written on it, and I'm assuming that (like
nearly all power supplies) it's an optimistic value.  I really don't
want to take the whole thing apart so I'm hoping someone finds an ata
driver bug!  Note that xmbmon shows the supply voltages don't particularly
change between idle and active states.  I don't know if this is much proof
of anything.

If anyone is successfully using two disks simultaneously on a SiI 0680
I'd like to hear about it.  Actually I'd like to hear about anybody at all
using a SiI 0680 (and whether or not they are having a good time) as they
are less than half the price of the Promise gear!

By the way, each "resetting devices" event caused data loss.  That is, when
running diff over two disks, "binary file blahblahblah differs" was printed
for each reset.  I didn't expect this.

Also, no amount of dodgy power supplies can explain the way the driver
forgets there really is a UDMA100 cable on ad4.

And I think the "ad4: removed from configuration" and reboot bit is a bug
too.  It should wait a bit and have another go before deleting my disks!

Cheers,

Stephen.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200306171453.h5HErICg012759>