From owner-freebsd-current@FreeBSD.ORG Wed Sep 16 04:31:34 2009 Return-Path: Delivered-To: current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CD931106568D; Wed, 16 Sep 2009 04:31:34 +0000 (UTC) (envelope-from morganw@chemikals.org) Received: from warped.bluecherry.net (unknown [IPv6:2001:440:eeee:fffb::2]) by mx1.freebsd.org (Postfix) with ESMTP id 4ABA78FC1B; Wed, 16 Sep 2009 04:31:34 +0000 (UTC) Received: from volatile.chemikals.org (adsl-67-247-36.shv.bellsouth.net [98.67.247.36]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by warped.bluecherry.net (Postfix) with ESMTPSA id 0C81894E1705; Tue, 15 Sep 2009 23:31:31 -0500 (CDT) Received: from localhost (morganw@localhost [127.0.0.1]) by volatile.chemikals.org (8.14.3/8.14.3) with ESMTP id n8G4VSi9088331; Tue, 15 Sep 2009 23:31:28 -0500 (CDT) (envelope-from morganw@chemikals.org) Date: Tue, 15 Sep 2009 23:31:28 -0500 (CDT) From: Wes Morgan To: Kris Kennaway In-Reply-To: <4AAD5DD2.4030104@FreeBSD.org> Message-ID: References: <4AAD4E51.5060908@FreeBSD.org> <4AAD5365.5000902@FreeBSD.org> <4AAD5DD2.4030104@FreeBSD.org> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; format=flowed; charset=US-ASCII X-Virus-Scanned: clamav-milter 0.95.2 at warped X-Virus-Status: Clean Cc: Alexander Motin , FreeBSD Current Subject: Re: ata timeouts under load X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 16 Sep 2009 04:31:35 -0000 On Sun, 13 Sep 2009, Kris Kennaway wrote: > Alexander Motin wrote: >> Kris Kennaway wrote: >>> I am getting timeouts on 8.0b4/HEAD when I do a lot of ZFS I/O to a pool >>> on ad4: >>> >>> atapci0: port >>> 0xc800-0xc807,0xc400-0xc403,0xc000-0xc007,0xb800-0xb803,0xb400-0xb40f,0xb000-0xb0ff >>> irq 20 at device 15.0 on pci0 >>> ata2: on atapci0 >>> ata3: on atapci0 >>> ata0: on atapci1 >>> ata1: on atapci1 >>> >>> ad4: 476940MB at ata2-master SATA150 >>> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - >>> completing request directly >>> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - >>> completing request directly >>> ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing >>> request directly >>> ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing >>> request directly >>> ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly >>> ad4: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=344052040 >>> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - >>> completing request directly >>> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - >>> completing request directly >>> >>> It becomes stuck in a loop displaying the above and is unable to >>> complete further I/O operations. I wonder if it is just batching up a >>> lot of I/O and then timing out because it is busy, and then not >>> recovering from this state? >>> >>> Any ideas what could be wrong? >> >> There are two different kinds of timeouts we can see: >> - first one, "ad4: WARNING - ..." is just a queue waiting timeout. It >> is not the reason, but consequence of the problem. And I have doubts >> that it is reasonable to do it. >> - second one, "TIMEOUT - WRITE_DMA48 ..." is a real command execution >> timeout. I don't know whether this is result of some improper error >> recovery, or you drive indeed lost required servo information near >> LBA=344052040 and tries to find it too long. You can try to read that >> sector and nearby ones with dd. >> > > It's always that sequence (with setfeatures timing out first, then the dma > later)...and the block number varies widely, also whether it's read/write. > The disk itself & the data it contains appears to be OK as far as I have been > able to determine so far. This may not be meaningful, but I used to have a lot of very similar (the messages, loop, etc is exactly the same) problems with VIA chipsets and an AMD cpu. Seemed to be triggered by a certain drive, but I never could figure it out totally. Moved to an Intel board/cpu and I've never seen it since. Looks like an older SATA1 chipset, so perhaps it could be the same problem. Problem was not related to zfs.