From owner-freebsd-stable@FreeBSD.ORG Wed May 19 09:11:07 2010 Return-Path: Delivered-To: stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8E7121065670 for ; Wed, 19 May 2010 09:11:07 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta09.westchester.pa.mail.comcast.net (qmta09.westchester.pa.mail.comcast.net [76.96.62.96]) by mx1.freebsd.org (Postfix) with ESMTP id 3CB308FC14 for ; Wed, 19 May 2010 09:11:06 +0000 (UTC) Received: from omta23.westchester.pa.mail.comcast.net ([76.96.62.74]) by qmta09.westchester.pa.mail.comcast.net with comcast id KM7y1e0021c6gX859MB76t; Wed, 19 May 2010 09:11:07 +0000 Received: from koitsu.dyndns.org ([98.248.46.159]) by omta23.westchester.pa.mail.comcast.net with comcast id KMB51e0053S48mS3jMB653; Wed, 19 May 2010 09:11:07 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 0D1EE9B419; Wed, 19 May 2010 02:11:04 -0700 (PDT) Date: Wed, 19 May 2010 02:11:04 -0700 From: Jeremy Chadwick To: Damian Gerow Message-ID: <20100519091103.GA72058@icarus.home.lan> References: <20100519021402.GI92949@plebeian.afflictions.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100519021402.GI92949@plebeian.afflictions.org> User-Agent: Mutt/1.5.20 (2009-06-14) Cc: stable@freebsd.org Subject: Re: AHCI timeouts on S3 resume X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 May 2010 09:11:07 -0000 On Tue, May 18, 2010 at 10:14:03PM -0400, Damian Gerow wrote: > A few months back, I swapped out my dying hard drive for a WD Scorpio Blue. > Cheap, seemed reliable, and it was the only drive the local shop had in > stock. However, it seems that AHCI doesn't like this device, and is having > troubles during an S3 resume. It appears as though I'm experiencing two > types of timeouts when resuming: recoverable, and non-recoverable. > > My question is: do I have a bad HDD, or is AHCI just not playing nicely? Your hard disk looks generally OK; it isn't going bad. The one thing I can't tell or not is whether the disk is actually spinning back up on resume; you'd have to literally listen for it, or look at SMART Attribute #4 before and after a suspend/resume. I'll discuss analysis of SMART statistics further down. The error messages you see coming from the AHCI driver indicate, to me, one of three things: 1) The ICH9 controller being stuck (possibly resume does something incorrectly to the controller), 2) FreeBSD not doing something quite right when coming out of suspend mode, or 3) the disk never waking up. If I had to take a guess, I'd say #2. mav@ might be able to help determine if something is being done incorrectly in the AHCI driver after resume. If the driver is doing the Right Thing(tm), then the next thing to do would be to discuss the problem on freebsd-acpi@. I can't help with these things. I will point out, however, that you've set this value in loader.conf: > hw.pci.do_power_nodriver="2" I've read the sysctl -d description for it, but I am not familiar with sleep/power states so I don't know the implications. I worry that this value may be causing problems with your ICH9 controller. If you could comment this out and re-try suspend/resume to see if AHCI times out, you might determine if it's responsible for the problem. > The HDD is a WD Scorpio blue, model WD5000BEVT-22A0RT0, and isn't exactly > the fastest drive on the planet. SMART seems to be relatively clean, with > some mild questions surrounding attributes 191, 9/193, and 194: > > ----- > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE > 3 Spin_Up_Time 0x0027 186 185 021 Pre-fail Always - 1675 > 4 Start_Stop_Count 0x0032 055 055 000 Old_age Always - 45174 > 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 723 > 191 G-Sense_Error_Rate 0x0032 072 072 000 Old_age Always - 28 > 193 Load_Cycle_Count 0x0032 162 162 000 Old_age Always - 115712 > 194 Temperature_Celsius 0x0022 112 106 000 Old_age Always - 35 > ----- Attribute #3 indicates the total amount of time it takes for the drive to spin up (usually in milliseconds). I'll point out that there are drives out there (such as the WD Caviar Black) which report ~8s spin-up times when powered on; this is normal. The drive is actually able to function during the spin-up, which is why those systems don't take a full 8 seconds before they're able to read from the HD. I wanted to point out this attribute because you've brought up concerns over AHCI 15 second timeouts being hit. Attribute #4 indicates the number of times the disk has been told by the controller to spin up or spin down. This counter should increase when your laptop goes in/out of suspend/resume. I wanted to point out this attribute because of what I said in my first paragraph. Attribute #9 indicates the total amount of time the hard disk has been powered on (read: not asleep) during its lifetime. I can't tell you whether or not this value is correct; only you would be able to determine that, given your usage patterns. I *have* seen desktop drives which have reported this value incorrectly (meaning, servers I know have been on for thousands of hours that show "4" for this RAW_VALUE; probably a firmware bug). Attribute #191 indicates a *rate* of G-shock events. The drive has a G-shock sensor inside of it. This value being non-zero is perfectly fine for laptops; people have a tendency to walk around with their systems on, tilt them sideways, place them on the desk firmly, etc.. The sensor is sensitive, and it isn't intended to detect "severity" of shock (e.g. throwing your laptop across the room); it's intended to measure a rate. The RAW_VALUE doesn't mean anything to me; 48 what? We don't know. Only WD knows if that's a safe value or not. So what do we do in this case? We look at the adjusted value VALUE and compare it to WORST and THRESH. SMART disk failure won't get triggered until VALUE reaches 000, so 162 is pretty good. I'd say don't worry about it. (I'll use this opportunity to point out to readers that this is why looking at RAW_VALUE explicitly is not always the correct way to read SMART). Attribute #193 indicates the number of times the actuator arm (thus heads) has been parked or come out of being parked. There is a known problem with some models of WD "Green Power" (GP) drives where the drive spends an excessive amount of time parking, and this counter increases rapidly. One FreeBSD user who reported this problem to Western Digital received a replacement firmware which addressed the problem. The WD Scorpio Blue drives (or some of them) may have this same problem -- HOWEVER, this model of hard disk (2.5" FF) is *specifically* intended for laptops and low-power environments, so the behaviour seen in this case could be 100% normal. WD would hopefully know. Attribute #194 indicates the temperature of the disk in Celsius. 35C is nothing to work about. If this system is a laptop, then that's an excellent temperature given airflow constraints. If this system is a PC, that temperature is also perfectly fine. You should be worried if this temperature reaches 45C or higher. I can tell from the adjusted value WORST that the drive has seen higher temperatures during its lifetime (how high I do not know; WD drives don't store that, while some other vendor's drives do). Hope this helps. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |