Date: Thu, 29 Oct 2009 17:39:29 -0400 From: Nathaniel W Filardo <nwf@cs.jhu.edu> To: freebsd-current@freebsd.org Subject: SATA disk error and hang until "atacontrol reinit" ? Message-ID: <20091029213929.GD19125@gradx.cs.jhu.edu>
next in thread | raw e-mail | index | archive | help
--eqp4TxRxnD4KrmFZ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable I have a FreeBSD/SPARC FreeBSD hydra.priv.oc.ietfng.org 9.0-CURRENT FreeBSD 9.0-CURRENT #11: Mon Oct 19 22:08:50 EDT 2009 root@hydra.priv.oc.ietfng.org:/systank/obj/systank/src/sys/NWFKERN sparc64 with a atapci1: <Marvell 88SX6081 SATA300 controller> port 0x300-0x3ff mem 0x600000-0x6fffff,0x800000-0xbfffff at device 1.0 on pci3 and eight SATA2 disks: ad0: 305245MB <Seagate ST3320620AS 3.AAJ> at ata4-master SATA300 ad1: 305245MB <Seagate ST3320620AS 3.AAE> at ata5-master SATA300 ad2: 305245MB <Seagate ST3320620AS 3.AAE> at ata6-master SATA300 ad3: 305245MB <Seagate ST3320620AS 3.AAJ> at ata7-master SATA300 ad4: 715404MB <WDC WD7500AADS-00L5B1 01.01A01> at ata8-master SATA300 ad5: 715404MB <WDC WD7500AADS-00L5B1 01.01A01> at ata9-master SATA300 ad6: 715404MB <WDC WD7500AADS-00L5B1 01.01A01> at ata10-master SATA300 ad7: 715404MB <WDC WD7500AADS-00L5B1 01.01A01> at ata11-master SATA300 The two sets of four disks are each RAIDZ'd together, and the two RAIDZs are in one storage pool. I've been stress-testing the disks by scrubbing and find that after a few days of uptime, I will get ad0: FAILURE - READ_DMA status=3D51<READY,DSC,ERROR> error=3D0 LBA=3D1032= 00892 (It's always ad0 that fails) and all I/O directed at this storage pool thro= ugh ZFS hangs. (I have not yet tested with dd from the raw disks; didn't think to do it, sorry.) During this period, zpool status reports 1 checksum error =66rom ad0, though I don't know if this is occurs before, after, or in synchrony with the ad0 READ_DMA FAILURE. Previously, I just rebooted, but this time I thought to run "atacontrol reinit ata4" (which is the channel holding ad0). That caused the kernel to say ad0: WARNING - WRITE_DMA48 requeued due to channel reset LBA=3D625104384 ad0: FAILURE - already active DMA on this device ad0: setting up DMA failed zpool status now indicates that the scrub is proceeding again, and that ad0 has suffered 3 read, 1 write, and 1 checksum error. I/O directed at the storage tank works again. Is my disk going bad or is there something more funny here? Even if the disk is going bad, shouldn't the controller time out the request eventually? Thanks much in advance. --nwf; --eqp4TxRxnD4KrmFZ Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAkrqC5EACgkQTeQabvr9Tc/uRwCePfRQSFIhG4i6E3BSV7f54+u/ uEcAnA/f70fQSym13X0RbWg1HuagzqKK =buS0 -----END PGP SIGNATURE----- --eqp4TxRxnD4KrmFZ--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20091029213929.GD19125>