Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 17 Jun 2003 23:54:02 +0800 (CST)
From:      Jaw-Shiang Joshua Huang <joshuah@synology.com>
To:        Stephen McKay <smckay@internode.on.net>, freebsd-hardware@freebsd.org
Subject:   Re: ATA READ command timeout (and worse)
Message-ID:  <200306171554.h5HFs2DQ041575@mail.synology.com>

next in thread | raw e-mail | index | archive | help
Hello, Stephen, 

I am interested in and have some questions about your case.
Please do not mind that I rise some questions below.

Because your machine will reboot automatically when the disk driver operation
is abnormal, it makes me want to know more.

Is your kernel compiled with DDB?  If not, it will reboot after 15 seconds
while hitting panic.  If it's reproducable, would you mind to compile a new 
kernel and try to find out where it panic or page fault?  I just want to know
this bug will make FreeBSD kernel reboot or just hit panic or page fault.

Do you think it's possible that sometimes the disk is not working properly?
In my memory, the disk DMA limitation checks the ATA identify command report
from the HDD or the register of IDE controller.  I'll doubt the HDD first. :)

Also, I guess that 230W is not enough to provide the current for 4 disks
concurrently.  Can you try your experiment with only 3 disks connected power
line again?  If possible, two will be better.

Sorry that I did not provide you any suggest but only questions.  I have not had a SiI IDE controller, either.

Joshua, Taiwan.

Stephen McKay <smckay@internode.on.net> said: 
> I've been plagued by ATA READ command timeouts on my home NFS server.
> Tonight I decided to get to the bottom of it, so I started torturing
> it. :-)
> 
> The machine is a Celeron 400 with 256MB ram in a Soyo 7IZB motherboard.
> There are 4 disks (1 on the M/B, and 3 on a SiI 0680).  There are also
> 2 NICs and a Trio64 in there.  It's pretty cramped, but I know it's not
> overheating because I've got a thermometer in there too.  (And it's
> winter here.)  It can serve up 15MB/s of NFS traffic (though 2 NICs),
> which isn't bad for an old CPU.
> 
> Here's a selection from dmesg:
> 
> Jun 17 23:54:11 peon /kernel: FreeBSD 4.7-RELEASE-p10 #15: Mon Mar 31 23:21:34 E
> ST 2003
> Jun 17 23:54:11 peon /kernel: atapci0: <Intel PIIX4 ATA33 controller> port 0xf00
> 0-0xf00f at device 7.1 on pci0
> Jun 17 23:54:11 peon /kernel: ata0: at 0x1f0 irq 14 on atapci0
> Jun 17 23:54:11 peon /kernel: ata1: at 0x170 irq 15 on atapci0
> Jun 17 23:54:11 peon /kernel: atapci1: <Sil 0680 ATA133 controller> port 0xe400-
> 0xe40f,0xe000-0xe003,0xdc00-0xdc07,0xd800-0xd803,0xd400-0xd407 mem 0xec001000-0x
> ec0010ff irq 11 at device 15.0 on pci0
> Jun 17 23:54:11 peon /kernel: ata2: at 0xd400 on atapci1
> Jun 17 23:54:11 peon /kernel: ata3: at 0xdc00 on atapci1
> Jun 17 23:54:11 peon /kernel: ad0: 19470MB <IBM-DJNA-352030> [39560/16/63] at at
> a0-master UDMA33
> Jun 17 23:54:11 peon /kernel: ad4: 78533MB <IC35L080AVVA07-0> [159560/16/63] at 
> ata2-master UDMA100
> Jun 17 23:54:11 peon /kernel: ad5: 39266MB <IBM-DTLA-305040> [79780/16/63] at at
> a2-slave UDMA100
> Jun 17 23:54:11 peon /kernel: ad6: 117246MB <Maxtor 6Y120L0> [238216/16/63] at a
> ta3-master UDMA133
> 
> My problem is I get this sort of thing frequently:
> 
> Jun 17 22:28:02 peon /kernel: ad6: READ command timeout tag=0 serv=0 - resetting
> Jun 17 22:28:02 peon /kernel: ata3: resetting devices .. done
> ...
> Jun 17 23:05:03 peon /kernel: ad4: READ command timeout tag=0 serv=0 - resetting
> Jun 17 23:05:03 peon /kernel: ata2: resetting devices .. ad4: DMA limited to UDMA33, non-ATA66 cable or device
> Jun 17 23:05:03 peon /kernel: done
> 
> I also get this sometimes:
> 
> Jun 17 23:47:53 peon /kernel: ad4: no status, reselecting device
> Jun 17 23:48:03 peon /kernel: ad4: READ command timeout tag=0 serv=0 - resetting
> Jun 17 23:48:03 peon /kernel: ata2: resetting devices .. ad4: DMA limited to UDMA33, non-ATA66 cable or device
> Jun 17 23:48:03 peon /kernel: done
> 
> Note the cable most certainly IS ATA66 (and ATA100) capable, and correctly
> connected.  Note that at boot time ad4 is NOT limited to UDMA33.
> 
> I have changed the cables and this made no difference.  I've tried manually
> limiting the speed (with atacontrol) to UDMA33 and that made no difference.
> 
> During the time the read commmand is timing out there is no data flow
> (as shown by systat) for 10 seconds, and this matches the timeout in
> the ata code.  My first theory was that the driver wasn't giving the read
> command a decent chance.  The evidence is against that theory. :-)
> 
> I tried accessing just a single disk at one time and was NOT able to make
> any individual disk fail.  When I made two disks busy (by running diff over
> 6GB of fairly large files) there was at least one failure, and sometimes
> as many as 6 failures per test run.  One test run caused a spectacular
> failure:
> 
> Jun 17 23:49:48 peon /kernel: ad4: no status, reselecting device
> Jun 17 23:49:48 peon /kernel: ad4: timeout sending command?sÿ e
> Jun 17 23:49:48 peon /kernel: ad4: error executing command - resetting
> Jun 17 23:49:48 peon /kernel: ata2: resetting devices .. 
> Jun 17 23:49:48 peon /kernel: ad4: removed from configuration
> Jun 17 23:49:48 peon /kernel: ad5: removed from configuration
> Jun 17 23:49:48 peon /kernel: done
> 
> At this point, the machine spontaneously rebooted!
> 
> So, what should I conclude?  I can think of only two reasonable
> possibilities: there is a race condition in the ata driver causing
> command completions to occasionally get lost, or my power supply
> just can't keep up with two disks reading simultaneously.
> 
> The power supply has 230W written on it, and I'm assuming that (like
> nearly all power supplies) it's an optimistic value.  I really don't
> want to take the whole thing apart so I'm hoping someone finds an ata
> driver bug!  Note that xmbmon shows the supply voltages don't particularly
> change between idle and active states.  I don't know if this is much proof
> of anything.
> 
> If anyone is successfully using two disks simultaneously on a SiI 0680
> I'd like to hear about it.  Actually I'd like to hear about anybody at all
> using a SiI 0680 (and whether or not they are having a good time) as they
> are less than half the price of the Promise gear!
> 
> By the way, each "resetting devices" event caused data loss.  That is, when
> running diff over two disks, "binary file blahblahblah differs" was printed
> for each reset.  I didn't expect this.
> 
> Also, no amount of dodgy power supplies can explain the way the driver
> forgets there really is a UDMA100 cable on ad4.
> 
> And I think the "ad4: removed from configuration" and reboot bit is a bug
> too.  It should wait a bit and have another go before deleting my disks!
> 
> Cheers,
> 
> Stephen.
> _______________________________________________
> freebsd-hardware@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hardware
> To unsubscribe, send any mail to "freebsd-hardware-unsubscribe@freebsd.org"
> 



--
Regards, 

Joshua Huang  --  joshuah@synology.com
Tel: 886-2-25521814 ext. 823
Synology Inc.





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200306171554.h5HFs2DQ041575>