From owner-freebsd-stable@FreeBSD.ORG Wed Feb 27 12:11:29 2008 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 74160106566C for ; Wed, 27 Feb 2008 12:11:29 +0000 (UTC) (envelope-from jdc@parodius.com) Received: from mx01.sc1.parodius.com (mx01.sc1.parodius.com [72.20.106.3]) by mx1.freebsd.org (Postfix) with ESMTP id 6AE1C8FC23 for ; Wed, 27 Feb 2008 12:11:29 +0000 (UTC) (envelope-from jdc@parodius.com) Received: by mx01.sc1.parodius.com (Postfix, from userid 1000) id 3572D1CC033; Wed, 27 Feb 2008 04:11:29 -0800 (PST) Date: Wed, 27 Feb 2008 04:11:29 -0800 From: Jeremy Chadwick To: Stephen Hurd Message-ID: <20080227121129.GA76419@eos.sc1.parodius.com> References: <47C52948.2070500@sasktel.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <47C52948.2070500@sasktel.net> User-Agent: Mutt/1.5.16 (2007-06-09) Cc: freebsd-stable@freebsd.org Subject: Re: ad0 READ_DMA TIMEOUT errors on install of 7.0-RELEASE X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 27 Feb 2008 12:11:29 -0000 On Wed, Feb 27, 2008 at 01:11:36AM -0800, Stephen Hurd wrote: > ... The corrupted sync message scared the heck out of me: > Waiting (max 60 seconds) for system process `vnlru' to stop...done > Waiti > Synncgi n(gm adxi sk6s0, svencoodnedss )r efmoari nsiynsgte.m. .pr1o0c ess > `syncer' to stop...8 7 8 3 3 3 1 0 0 0 0 done http://lists.freebsd.org/pipermail/freebsd-current/2007-October/078145.html http://lists.freebsd.org/pipermail/freebsd-current/2007-November/079130.html http://lists.freebsd.org/pipermail/freebsd-current/2007-November/079131.html http://lists.freebsd.org/pipermail/freebsd-stable/2007-December/038727.html > And after the reboot, the READ_DMA timeouts were back. You're not the only one seeing this behaviour. There are too many posts in the past reporting similar. Here's the breakdown: * Some reporting this problem have been told to replace their ATA or SATA cables (which have previously been known to be working, but cables going bad does happen) -- and this has fixed the problem for a couple. * Some have checked their SMART stats and found their disks to be in perfect condition. * Some have switched to alternate operating systems (usually Linux) for a short while and seen no sign of DMA timeouts. * Some have replaced the storage controller to no avail, and some have replaced the entire motherboard to no avail. In some cases (myself included), replacing the motherboard did in fact help. However: in your case, your disk does look to have problems based on the SMART output you provided. It does not matter how new/old the disk is, by the way. I'll point out the problematic stats. You need to replace the disk ASAP. BTW, any SMART stats you see labelled "Offline" means the numbers will not be updated until you perform an offline test (smartctl -t short or smartctl -t long). > The only "odd" think I can think of about my system is an unusually high HZ > value (2386) I'm building a kernel now with 1000 to check if that makes a > difference. This is not the cause, rest assured. > SMART Attributes Data Structure revision number: 16 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE > 5 Reallocated_Sector_Ct 0x0033 253 253 063 Pre-fail Always - 4 This shows you've had 4 reallocated sectors, meaning your disk does in fact have bad blocks. In 90% of the cases out there, bad blocks continue to "grow" over time, due to whatever reason (I remember reading an article explaining it, but I can't for the life of me find the URL). > 194 Temperature_Celsius 0x0032 253 253 000 Old_age Always - 48 This is excessive, and may be attributing to problems. A hard disk running at 48C is not a good sign. This should really be somewhere between high 20s and mid 30s. > 195 Hardware_ECC_Recovered 0x000a 253 252 000 Old_age Always - 11498 This implies a large number of ECC (error correction) activities have occured, but all were successful. > Error 2 occurred at disk power-on lifetime: 5171 hours (215 days + 11 hours) > When the command that caused the error occurred, the device was in an unknown state. > Error 1 occurred at disk power-on lifetime: 5171 hours (215 days + 11 hours) > When the command that caused the error occurred, the device was in an unknown state. These are automated SMART log entries confirming the DMA failures. The fact that SMART saw them means that the disk is also aware of said issues. These may have been caused by the reallocated sectors. It's also interesting that the LBAs are different than the ones FreeBSD reported issues with. My advice to you is: replace the disk ASAP. This problem will only get worse. Try another hard disk brand too (I don't have anything "against" Maxtor, but usually its recommended to avoid a brand you have problems with until the next time you have issues, then switch brands, etc. etc...). I'm very fond of Western Digital's SE16, RE, and RE2 series currently. But avoid Fujitsu and Samsung (both have a long track record of having buggy drive firmwares, forcing vendors to make custom workarounds for issues); stick with Seagate, Western Digital, or Maxtor. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |