From owner-freebsd-stable@FreeBSD.ORG Sat Jan 26 18:32:06 2008 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8C85316A41A for ; Sat, 26 Jan 2008 18:32:06 +0000 (UTC) (envelope-from joe@skyrush.com) Received: from shadow.wildlava.net (shadow.wildlava.net [67.40.138.81]) by mx1.freebsd.org (Postfix) with ESMTP id 4E33313C467 for ; Sat, 26 Jan 2008 18:32:06 +0000 (UTC) (envelope-from joe@skyrush.com) Received: from [10.1.2.160] (pawnee.wildlava.net [67.40.138.85]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by shadow.wildlava.net (Postfix) with ESMTP id 2E2978F441 for ; Sat, 26 Jan 2008 11:32:05 -0700 (MST) Message-ID: <479B7C60.7000800@skyrush.com> Date: Sat, 26 Jan 2008 11:30:56 -0700 From: Joe Peterson User-Agent: Thunderbird 2.0.0.9 (Windows/20071031) MIME-Version: 1.0 To: freebsd-stable@freebsd.org References: <479A0731.6020405@skyrush.com> <20080125162940.GA38494@eos.sc1.parodius.com> <479A3764.6050800@skyrush.com> <3803988D-8D18-4E89-92EA-19BF62FD2395@mac.com> <479A4CB0.5080206@skyrush.com> <20080126003845.GA52183@eos.sc1.parodius.com> <479A86E5.5060806@skyrush.com> <20080126012124.GA53400@eos.sc1.parodius.com> In-Reply-To: <20080126012124.GA53400@eos.sc1.parodius.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Subject: Re: "ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 26 Jan 2008 18:32:06 -0000 I performed a ZFS scrub, which finished yesterday, and no new /var/log/messages errors were reported during that time. However, the scrub found something interesting: crater# zpool status -v pool: tank state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: scrub completed with 1 errors on Fri Jan 25 12:52:32 2008 config: NAME STATE READ WRITE CKSUM tank ONLINE 1 3 2 ad0s1d ONLINE 1 3 2 errors: Permanent errors have been detected in the following files: /home/joe/music/jukebox/christmas/Esquivel/Merry_XMas_from_the_SpaceAge_ Bachelor_Pad/07-Snowfall.mp3 Note that I have not touched this file since copying it to this drive. So, it seems one file failed a checksum check during the scrub. I now (expectedly) get errors trying to read this file - probably ZFS indicating the condition. When I just logged in tonight, I got two more /var/log/messages disk messages about WRITE_DMA48 TIMEOUT/FAILURE - might be a coincidence (just as I was typing my password). Also, smartctl still shows PASSED, however, this is interesting: 195 Hardware_ECC_Recovered 0x001a 061 046 000 Old_age Always - 9070 The number is much *smaller* now! It was "6" a few minutes before this... wrap around? Hmm, I'm really not sure, at this point, what is going on. So I have started a "SeaTools" (disk scanner from Seagate) "long test" of the drive. The short test passed already. The results should be interesting. If it finds nothing wrong, I am going to start to wonder if I am experiencing ZFS bugs that just happen to look like drive problems. I already did a long read, under linux, of disk contents, and got no messages about anything wrong. If I can turn on any debugging info to help determine if this is software-related, let me know the magic keywords to use. :) -Joe