From owner-freebsd-stable@FreeBSD.ORG Sat Jan 26 01:03:17 2008 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 40D9716A420 for ; Sat, 26 Jan 2008 01:03:17 +0000 (UTC) (envelope-from joe@skyrush.com) Received: from shadow.wildlava.net (shadow.wildlava.net [67.40.138.81]) by mx1.freebsd.org (Postfix) with ESMTP id 8D50E13C45D for ; Sat, 26 Jan 2008 01:03:16 +0000 (UTC) (envelope-from joe@skyrush.com) Received: from [10.0.3.98] (mail.boulder.swri.edu [65.241.78.2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by shadow.wildlava.net (Postfix) with ESMTP id 9BAAD8F441; Fri, 25 Jan 2008 18:03:15 -0700 (MST) Message-ID: <479A86E5.5060806@skyrush.com> Date: Fri, 25 Jan 2008 18:03:33 -0700 From: Joe Peterson User-Agent: Thunderbird 2.0.0.9 (X11/20071119) MIME-Version: 1.0 To: Jeremy Chadwick References: <479A0731.6020405@skyrush.com> <20080125162940.GA38494@eos.sc1.parodius.com> <479A3764.6050800@skyrush.com> <3803988D-8D18-4E89-92EA-19BF62FD2395@mac.com> <479A4CB0.5080206@skyrush.com> <20080126003845.GA52183@eos.sc1.parodius.com> In-Reply-To: <20080126003845.GA52183@eos.sc1.parodius.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-stable@freebsd.org Subject: Re: "ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 26 Jan 2008 01:03:17 -0000 Jeremy Chadwick wrote: > Joe, I wanted to send you a note about something that I'm still in the > process of dealing with. The timing couldn't be more ironic. > > I decided it would be worthwhile to migrate from my two-disk ZFS stripe > with a non-ZFS disk for nightly backups, to to a RAIDZ pool of all 3 > disks combined (since they're all the same size). I had another > terminal with gstat -I500ms running in it, so I could see overall I/O. > > All was going well until about the 81GB mark of the copy. gstat started > showing 0KB in/out on all the drives, and the rsync was stalled. ^Z did > nothing, which is usually a bad sign. :-) I ssh'd in and did a dmesg > (summarised): > > ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly > ad6: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly > ad6: WARNING - SET_MULTI taskqueue timeout - completing request directly > ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951071 > ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951327 > ad6: FAILURE - WRITE_DMA timed out LBA=13951071 > ad6: FAILURE - WRITE_DMA timed out LBA=13951327 > ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951583 > ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951839 > ad6: FAILURE - WRITE_DMA timed out LBA=13951583 > ad6: FAILURE - WRITE_DMA timed out LBA=13951839 > ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13952095 > ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13952351 > g_vfs_done():ad6s1d[WRITE(offset=7142916096, length=131072)]error = 5 > g_vfs_done():ad6s1d[WRITE(offset=7143047168, length=131072)]error = 5 > g_vfs_done():ad6s1d[WRITE(offset=7143178240, length=131072)]error = 5 > g_vfs_done():ad6s1d[WRITE(offset=7143309312, length=131072)]error = 5 > g_vfs_done():ad6s1d[WRITE(offset=7143440384, length=131072)]error = 5 > > It appears my /dev/ad6 (a Seagate -- more irony) must have some bad > blocks. Actually, after letting things go for a while, I realised the > box just locked up. Probably kernel panic'd due to the I/O problem. > I'll have to poke at SMART stats later to see what showed up. Wow, pretty crazy! Hmm, and yes, those LBAs do look close together. Well, let me know how the smartctl output looks. I'd be curious if your bad sector count rises. I had noticed that 1 BTW, I tried: crater# dd if=/dev/ad1s4 of=/dev/null bs=64k ^C1408596+0 records in 1408596+0 records out 92313747456 bytes transferred in 1415.324362 secs (65224446 bytes/sec) (I let it go for 92GB or so) - no messages about ad1. So I wonder if this points at either the cable connector on ad0 or the drive itself. I guess I'd rather have a failing drive than motherboard... I originally was wondering if somehow something peculiar about ZFS's disk access pattern was making it happen... THanks for the recomendations. I'll keep an eye on it, and I'll let you know what a cable change does for me. Still, I have not had any ad0 messages since this morning (I haven't been using the system today much, but maybe the cron processes are more likely to trigger it... -Joe