From owner-freebsd-current@FreeBSD.ORG Fri Nov 7 10:10:10 2003 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 2D0DC16A4CF; Fri, 7 Nov 2003 10:10:10 -0800 (PST) Received: from obsecurity.dyndns.org (adsl-63-207-60-234.dsl.lsan03.pacbell.net [63.207.60.234]) by mx1.FreeBSD.org (Postfix) with ESMTP id BB1974403D; Fri, 7 Nov 2003 10:10:08 -0800 (PST) (envelope-from kris@obsecurity.org) Received: from rot13.obsecurity.org (rot13.obsecurity.org [10.0.0.5]) by obsecurity.dyndns.org (Postfix) with ESMTP id 33B5666D74; Fri, 7 Nov 2003 10:10:08 -0800 (PST) Received: by rot13.obsecurity.org (Postfix, from userid 1000) id F03698C6; Fri, 7 Nov 2003 10:10:07 -0800 (PST) Date: Fri, 7 Nov 2003 10:10:07 -0800 From: Kris Kennaway To: sos@FreeBSD.org, re@FreeBSD.org Message-ID: <20031107181007.GA19911@rot13.obsecurity.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="7AUc2qLy4jB3hD7Z" Content-Disposition: inline User-Agent: Mutt/1.4.1i cc: current@FreeBSD.org Subject: Too many uncorrectable read errors with atang X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 07 Nov 2003 18:10:10 -0000 --7AUc2qLy4jB3hD7Z Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Since upgrading the bento package machines to -current I am getting a lot of the following errors: ad0: FAILURE - READ_DMA status=51 error=40 For example: ad0: FAILURE - READ_DMA status=51 error=40 ad0: FAILURE - READ_DMA status=51 error=40 ad0: FAILURE - READ_DMA status=51 error=40 ad0: FAILURE - READ_DMA status=51 error=40 ad0: FAILURE - READ_DMA status=51 error=40 ad0: FAILURE - READ_DMA status=51 error=40 ad0: FAILURE - READ_DMA status=51 error=40 ad0: FAILURE - READ_DMA status=51 error=40 ad0: FAILURE - READ_DMA status=51 error=40 ad0: FAILURE - READ_DMA status=51 error=40 ad0: FAILURE - READ_DMA status=51 error=40 ad0: FAILURE - READ_DMA status=51 error=40 ad0: TIMEOUT - READ_DMA retrying (2 retries left) ata0: resetting devices .. ad0: FAILURE - already active DMA on this device ad0: setting up DMA failed panic: initiate_write_inodeblock_ufs2: already started Debugger("panic") Stopped at Debugger+0x54: xchgl %ebx,in_Debugger.0 db> trace Debugger(c0739e72,c07ac4a0,c074d9d0,d897b7a4,100) at Debugger+0x54 panic(c074d9d0,c058d793,d897b7cc,c058d72b,c07af7e0) at panic+0xd5 initiate_write_inodeblock_ufs2(c54c8780,cec0f1e8,1,c5a88400,c46f2b40) at initiate_write_inodeblock_ufs2+0x6e6 softdep_disk_io_initiation(cec0f1e8,c073916a,167,1,fcf58000) at softdep_disk_io_initiation+0x8d spec_xstrategy(c4ed3b68,cec0f1e8,c13e6720,c4e791bc,200200a0) at spec_xstrategy+0x117 spec_specstrategy(d897b8ec,d897b914,c05adbf4,d897b8ec,1) at spec_specstrategy+0x72 spec_vnoperate(d897b8ec,1,c073ff9e,360,0) at spec_vnoperate+0x18 bwrite(cec0f1e8,cec0f1e8,1,8000,0) at bwrite+0x424 ffs_update(c5aab490,1,d897b9b0,c058d72b,c07af880) at ffs_update+0x31b ffs_truncate(c5aab490,0,0,c00,0) at ffs_truncate+0x8d8 ufs_inactive(d897bbfc,d897bc18,c05c1a13,d897bbfc,0) at ufs_inactive+0x10c ufs_vnoperate(d897bbfc,0,c074185c,8d3,c07953a0) at ufs_vnoperate+0x18 vput(c5aab490,825d2,0,d897bc38,c074185c) at vput+0x143 handle_workitem_remove(c5b40a20,0,2,c07afa88,c4e63800) at handle_workitem_remove+0x1d1 process_worklist_item(0,0,3faba10a,0,d897bcf0) at process_worklist_item+0x19e softdep_process_worklist(0,0,c074185c,6e0,0) at softdep_process_worklist+0xe0 sched_sync(0,d897bd48,c0737724,311,aaf2e368) at sched_sync+0x384 fork_exit(c05c0770,0,d897bd48) at fork_exit+0xb4 fork_trampoline() at fork_trampoline+0x8 --- trap 0x1, eip = 0, esp = 0xd897bd7c, ebp = 0 --- db> So far this has happened (well, the panic above was new) on 5 separate machines that were all working on older -current. Now, these are all IBM DeathStar drives, but previously I was only experiencing ata errors every month or two, and they were correctable for another month or two by /dev/zero'ing the drive. To suddenly start receiving errors on 5 out of 7 drives in the past few weeks is a significant anomaly. Perhaps one of the following is happening: 1) All my drives have performed mass suicide at once 2) ATAng is detecting errors that the ATAog did not 3) ATAng is not trying as hard as ATAog to recover from the errors from the crappy drives 4) ATAng has a bug on this hardware. Furthermore, I'd like to know why the panic occurred above. Following is an excerpt from boot -v: atapci0: port 0xffa0-0xffaf at device 31.1 on pci0 ata0: reset tp1 mask=03 ostat0=50 ostat1=00 ata0-master: stat=0x50 err=0x01 lsb=0x00 msb=0x00 ata0-slave: stat=0x00 err=0x01 lsb=0x00 msb=0x00 ata0: reset tp2 mask=03 stat0=50 stat1=00 devices=0x1 ata0: at 0x1f0 irq 14 on atapci0 ata0: [MPSAFE] ata1: at 0x170 irq 15 on atapci0 ata1: [MPSAFE] [...] ata0-master: pio=0x0c wdma=0x22 udma=0x45 cable=80pin ad0: setting UDMA66 on Intel ICH chip GEOM: create disk ad0 dp=0xc47a4070 ad0: ATA-5 disk at ata0-master ad0: 29314MB (60036480 sectors), 59560 C, 16 H, 63 S, 512 B ad0: 16 secs/int, 1 depth queue, UDMA66 GEOM: new disk ad0 GEOM: Configure ad0b, start 0 length 1073741824 end 1073741823 GEOM: Configure ad0c, start 0 length 30738677760 end 30738677759 GEOM: Configure ad0e, start 1073741824 length 2147483648 end 3221225471 GEOM: Configure ad0f, start 3221225472 length 27517452288 end 30738677759 Kris --7AUc2qLy4jB3hD7Z Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (FreeBSD) iD8DBQE/q9//Wry0BWjoQKURAj48AJ9bdt0ezTVeI2ACJATh5sTeI+B2+ACdGiGa G5WDNPHGiXuiFFw2V9aes5A= =sY27 -----END PGP SIGNATURE----- --7AUc2qLy4jB3hD7Z--