Date: Tue, 15 Sep 2009 19:21:02 +0300 From: Alexander Motin <mav@FreeBSD.org> To: Pawel Jakub Dawidek <pjd@FreeBSD.org> Cc: Kris Kennaway <kris@FreeBSD.org>, FreeBSD Current <current@freebsd.org> Subject: Re: ata timeouts under load Message-ID: <4AAFBEEE.8090101@FreeBSD.org> In-Reply-To: <20090915155436.GB2199@garage.freebsd.pl> References: <4AAD4E51.5060908@FreeBSD.org> <4AAD5365.5000902@FreeBSD.org> <20090915155436.GB2199@garage.freebsd.pl>
next in thread | previous in thread | raw e-mail | index | archive | help
Pawel Jakub Dawidek wrote: > On Sun, Sep 13, 2009 at 11:17:41PM +0300, Alexander Motin wrote: >> Kris Kennaway wrote: >>> I am getting timeouts on 8.0b4/HEAD when I do a lot of ZFS I/O to a pool >>> on ad4: >>> >>> atapci0: <VIA 6420 SATA150 controller> port >>> 0xc800-0xc807,0xc400-0xc403,0xc000-0xc007,0xb800-0xb803,0xb400-0xb40f,0xb000-0xb0ff >>> irq 20 at device 15.0 on pci0 >>> ata2: <ATA channel 0> on atapci0 >>> ata3: <ATA channel 1> on atapci0 >>> ata0: <ATA channel 0> on atapci1 >>> ata1: <ATA channel 1> on atapci1 >>> >>> ad4: 476940MB <WDC WD5000AAKS-00TMA0 12.01C01> at ata2-master SATA150 >>> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - >>> completing request directly >>> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - >>> completing request directly >>> ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing >>> request directly >>> ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing >>> request directly >>> ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly >>> ad4: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=344052040 >>> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - >>> completing request directly >>> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - >>> completing request directly >>> >>> It becomes stuck in a loop displaying the above and is unable to >>> complete further I/O operations. I wonder if it is just batching up a >>> lot of I/O and then timing out because it is busy, and then not >>> recovering from this state? >>> >>> Any ideas what could be wrong? >> There are two different kinds of timeouts we can see: >> - first one, "ad4: WARNING - ..." is just a queue waiting timeout. It >> is not the reason, but consequence of the problem. And I have doubts >> that it is reasonable to do it. >> - second one, "TIMEOUT - WRITE_DMA48 ..." is a real command execution >> timeout. I don't know whether this is result of some improper error >> recovery, or you drive indeed lost required servo information near >> LBA=344052040 and tries to find it too long. You can try to read that >> sector and nearby ones with dd. > > Could this be related to BIO_FLUSH requests? BIO_FLUSH implemented via FLUSHCACHE48 command, not a WRITE_DMA48, so message doesn't fit. But theoretically, if drive has write caching enabled, FLUSHCACHE48 could take longer to execute then other commands. Especially looking that ATA(4) has very strict timeouts. -- Alexander Motin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4AAFBEEE.8090101>