From owner-freebsd-stable@FreeBSD.ORG  Wed May 19 09:11:07 2010
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8E7121065670
	for <stable@freebsd.org>; Wed, 19 May 2010 09:11:07 +0000 (UTC)
	(envelope-from jdc@koitsu.dyndns.org)
Received: from qmta09.westchester.pa.mail.comcast.net
	(qmta09.westchester.pa.mail.comcast.net [76.96.62.96])
	by mx1.freebsd.org (Postfix) with ESMTP id 3CB308FC14
	for <stable@freebsd.org>; Wed, 19 May 2010 09:11:06 +0000 (UTC)
Received: from omta23.westchester.pa.mail.comcast.net ([76.96.62.74])
	by qmta09.westchester.pa.mail.comcast.net with comcast
	id KM7y1e0021c6gX859MB76t; Wed, 19 May 2010 09:11:07 +0000
Received: from koitsu.dyndns.org ([98.248.46.159])
	by omta23.westchester.pa.mail.comcast.net with comcast
	id KMB51e0053S48mS3jMB653; Wed, 19 May 2010 09:11:07 +0000
Received: by icarus.home.lan (Postfix, from userid 1000)
	id 0D1EE9B419; Wed, 19 May 2010 02:11:04 -0700 (PDT)
Date: Wed, 19 May 2010 02:11:04 -0700
From: Jeremy Chadwick <freebsd@jdc.parodius.com>
To: Damian Gerow <dgerow@afflictions.org>
Message-ID: <20100519091103.GA72058@icarus.home.lan>
References: <20100519021402.GI92949@plebeian.afflictions.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20100519021402.GI92949@plebeian.afflictions.org>
User-Agent: Mutt/1.5.20 (2009-06-14)
Cc: stable@freebsd.org
Subject: Re: AHCI timeouts on S3 resume
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 19 May 2010 09:11:07 -0000

On Tue, May 18, 2010 at 10:14:03PM -0400, Damian Gerow wrote:
> A few months back, I swapped out my dying hard drive for a WD Scorpio Blue.
> Cheap, seemed reliable, and it was the only drive the local shop had in
> stock.  However, it seems that AHCI doesn't like this device, and is having
> troubles during an S3 resume.  It appears as though I'm experiencing two
> types of timeouts when resuming: recoverable, and non-recoverable.
> 
> My question is: do I have a bad HDD, or is AHCI just not playing nicely?

Your hard disk looks generally OK; it isn't going bad.  The one thing I
can't tell or not is whether the disk is actually spinning back up on
resume; you'd have to literally listen for it, or look at SMART
Attribute #4 before and after a suspend/resume.  I'll discuss analysis
of SMART statistics further down.

The error messages you see coming from the AHCI driver indicate, to me,
one of three things: 1) The ICH9 controller being stuck (possibly resume
does something incorrectly to the controller), 2) FreeBSD not doing
something quite right when coming out of suspend mode, or 3) the disk
never waking up.  If I had to take a guess, I'd say #2.

mav@ might be able to help determine if something is being done
incorrectly in the AHCI driver after resume.  If the driver is doing the
Right Thing(tm), then the next thing to do would be to discuss the
problem on freebsd-acpi@.  I can't help with these things.

I will point out, however, that you've set this value in loader.conf:

> hw.pci.do_power_nodriver="2"

I've read the sysctl -d description for it, but I am not familiar with
sleep/power states so I don't know the implications.  I worry that this
value may be causing problems with your ICH9 controller.  If you could
comment this out and re-try suspend/resume to see if AHCI times out, you
might determine if it's responsible for the problem.

> The HDD is a WD Scorpio blue, model WD5000BEVT-22A0RT0, and isn't exactly
> the fastest drive on the planet.  SMART seems to be relatively clean, with
> some mild questions surrounding attributes 191, 9/193, and 194:
> 
> -----
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
>   3 Spin_Up_Time            0x0027   186   185   021    Pre-fail  Always       -       1675
>   4 Start_Stop_Count        0x0032   055   055   000    Old_age   Always       -       45174
>   9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       723
> 191 G-Sense_Error_Rate      0x0032   072   072   000    Old_age   Always       -       28
> 193 Load_Cycle_Count        0x0032   162   162   000    Old_age   Always       -       115712
> 194 Temperature_Celsius     0x0022   112   106   000    Old_age   Always       -       35
> -----

Attribute #3 indicates the total amount of time it takes for the drive
to spin up (usually in milliseconds).  I'll point out that there are
drives out there (such as the WD Caviar Black) which report ~8s spin-up
times when powered on; this is normal.  The drive is actually able to
function during the spin-up, which is why those systems don't take a
full 8 seconds before they're able to read from the HD.  I wanted to
point out this attribute because you've brought up concerns over AHCI 15
second timeouts being hit.

Attribute #4 indicates the number of times the disk has been told by the
controller to spin up or spin down.  This counter should increase when
your laptop goes in/out of suspend/resume.  I wanted to point out this
attribute because of what I said in my first paragraph.

Attribute #9 indicates the total amount of time the hard disk has been
powered on (read: not asleep) during its lifetime.  I can't tell you
whether or not this value is correct; only you would be able to
determine that, given your usage patterns.  I *have* seen desktop drives
which have reported this value incorrectly (meaning, servers I know have
been on for thousands of hours that show "4" for this RAW_VALUE;
probably a firmware bug).

Attribute #191 indicates a *rate* of G-shock events.  The drive has a
G-shock sensor inside of it.  This value being non-zero is perfectly
fine for laptops; people have a tendency to walk around with their
systems on, tilt them sideways, place them on the desk firmly, etc..
The sensor is sensitive, and it isn't intended to detect "severity" of
shock (e.g. throwing your laptop across the room); it's intended to
measure a rate.  The RAW_VALUE doesn't mean anything to me; 48 what?  We
don't know.  Only WD knows if that's a safe value or not.  So what do we
do in this case?  We look at the adjusted value VALUE and compare it to
WORST and THRESH.  SMART disk failure won't get triggered until VALUE
reaches 000, so 162 is pretty good.  I'd say don't worry about it.
(I'll use this opportunity to point out to readers that this is why
looking at RAW_VALUE explicitly is not always the correct way to read
SMART).

Attribute #193 indicates the number of times the actuator arm (thus
heads) has been parked or come out of being parked.  There is a known
problem with some models of WD "Green Power" (GP) drives where the drive
spends an excessive amount of time parking, and this counter increases
rapidly.  One FreeBSD user who reported this problem to Western Digital
received a replacement firmware which addressed the problem.  The WD
Scorpio Blue drives (or some of them) may have this same problem --
HOWEVER, this model of hard disk (2.5" FF) is *specifically* intended
for laptops and low-power environments, so the behaviour seen in this
case could be 100% normal.  WD would hopefully know.

Attribute #194 indicates the temperature of the disk in Celsius.  35C is
nothing to work about.  If this system is a laptop, then that's an
excellent temperature given airflow constraints.  If this system is a
PC, that temperature is also perfectly fine.  You should be worried if
this temperature reaches 45C or higher.  I can tell from the adjusted
value WORST that the drive has seen higher temperatures during its
lifetime (how high I do not know; WD drives don't store that, while some
other vendor's drives do).

Hope this helps.

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |