From owner-freebsd-current@FreeBSD.ORG Thu Oct 29 21:51:00 2009 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id AB6B310656A3 for ; Thu, 29 Oct 2009 21:51:00 +0000 (UTC) (envelope-from nwf@cs.jhu.edu) Received: from blaze.cs.jhu.edu (blaze.cs.jhu.edu [128.220.13.50]) by mx1.freebsd.org (Postfix) with ESMTP id 265278FC12 for ; Thu, 29 Oct 2009 21:50:59 +0000 (UTC) Received: from gradx.cs.jhu.edu (gradx.cs.jhu.edu [128.220.13.52]) by blaze.cs.jhu.edu (8.14.3/8.14.3) with ESMTP id n9TLdTox012785 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT) for ; Thu, 29 Oct 2009 17:39:29 -0400 (EDT) Received: from gradx.cs.jhu.edu (localhost.localdomain [127.0.0.1]) by gradx.cs.jhu.edu (8.14.2/8.13.1) with ESMTP id n9TLdTC1017820 for ; Thu, 29 Oct 2009 17:39:29 -0400 Received: (from nwf@localhost) by gradx.cs.jhu.edu (8.14.2/8.13.8/Submit) id n9TLdTNA017819 for freebsd-current@freebsd.org; Thu, 29 Oct 2009 17:39:29 -0400 Date: Thu, 29 Oct 2009 17:39:29 -0400 From: Nathaniel W Filardo To: freebsd-current@freebsd.org Message-ID: <20091029213929.GD19125@gradx.cs.jhu.edu> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="eqp4TxRxnD4KrmFZ" Content-Disposition: inline User-Agent: Mutt/1.5.18 (2008-05-17) X-Mailman-Approved-At: Thu, 29 Oct 2009 22:02:15 +0000 Subject: SATA disk error and hang until "atacontrol reinit" ? X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 29 Oct 2009 21:51:00 -0000 --eqp4TxRxnD4KrmFZ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable I have a FreeBSD/SPARC FreeBSD hydra.priv.oc.ietfng.org 9.0-CURRENT FreeBSD 9.0-CURRENT #11: Mon Oct 19 22:08:50 EDT 2009 root@hydra.priv.oc.ietfng.org:/systank/obj/systank/src/sys/NWFKERN sparc64 with a atapci1: port 0x300-0x3ff mem 0x600000-0x6fffff,0x800000-0xbfffff at device 1.0 on pci3 and eight SATA2 disks: ad0: 305245MB at ata4-master SATA300 ad1: 305245MB at ata5-master SATA300 ad2: 305245MB at ata6-master SATA300 ad3: 305245MB at ata7-master SATA300 ad4: 715404MB at ata8-master SATA300 ad5: 715404MB at ata9-master SATA300 ad6: 715404MB at ata10-master SATA300 ad7: 715404MB at ata11-master SATA300 The two sets of four disks are each RAIDZ'd together, and the two RAIDZs are in one storage pool. I've been stress-testing the disks by scrubbing and find that after a few days of uptime, I will get ad0: FAILURE - READ_DMA status=3D51 error=3D0 LBA=3D1032= 00892 (It's always ad0 that fails) and all I/O directed at this storage pool thro= ugh ZFS hangs. (I have not yet tested with dd from the raw disks; didn't think to do it, sorry.) During this period, zpool status reports 1 checksum error =66rom ad0, though I don't know if this is occurs before, after, or in synchrony with the ad0 READ_DMA FAILURE. Previously, I just rebooted, but this time I thought to run "atacontrol reinit ata4" (which is the channel holding ad0). That caused the kernel to say ad0: WARNING - WRITE_DMA48 requeued due to channel reset LBA=3D625104384 ad0: FAILURE - already active DMA on this device ad0: setting up DMA failed zpool status now indicates that the scrub is proceeding again, and that ad0 has suffered 3 read, 1 write, and 1 checksum error. I/O directed at the storage tank works again. Is my disk going bad or is there something more funny here? Even if the disk is going bad, shouldn't the controller time out the request eventually? Thanks much in advance. --nwf; --eqp4TxRxnD4KrmFZ Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAkrqC5EACgkQTeQabvr9Tc/uRwCePfRQSFIhG4i6E3BSV7f54+u/ uEcAnA/f70fQSym13X0RbWg1HuagzqKK =buS0 -----END PGP SIGNATURE----- --eqp4TxRxnD4KrmFZ--