From owner-freebsd-stable@FreeBSD.ORG  Sat May 15 16:26:28 2010
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 16774106566C
	for <freebsd-stable@freebsd.org>; Sat, 15 May 2010 16:26:28 +0000 (UTC)
	(envelope-from jdc@koitsu.dyndns.org)
Received: from qmta02.westchester.pa.mail.comcast.net
	(qmta02.westchester.pa.mail.comcast.net [76.96.62.24])
	by mx1.freebsd.org (Postfix) with ESMTP id B71AF8FC1D
	for <freebsd-stable@freebsd.org>; Sat, 15 May 2010 16:26:27 +0000 (UTC)
Received: from omta17.westchester.pa.mail.comcast.net ([76.96.62.89])
	by qmta02.westchester.pa.mail.comcast.net with comcast
	id HsMC1e0021vXlb852sSTJ3; Sat, 15 May 2010 16:26:27 +0000
Received: from koitsu.dyndns.org ([98.248.46.159])
	by omta17.westchester.pa.mail.comcast.net with comcast
	id HsSS1e0023S48mS3dsSS8g; Sat, 15 May 2010 16:26:27 +0000
Received: by icarus.home.lan (Postfix, from userid 1000)
	id B838E9B419; Sat, 15 May 2010 09:26:24 -0700 (PDT)
Date: Sat, 15 May 2010 09:26:24 -0700
From: Jeremy Chadwick <freebsd@jdc.parodius.com>
To: Pieter de Boer <pieter@os3.nl>
Message-ID: <20100515162624.GA39585@icarus.home.lan>
References: <4BED8B89.6010901@os3.nl> <20100514195346.GA8977@icarus.home.lan>
	<4BEDBC08.2040002@os3.nl> <20100514224236.GA11680@icarus.home.lan>
	<4BEE476B.6020407@os3.nl>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4BEE476B.6020407@os3.nl>
User-Agent: Mutt/1.5.20 (2009-06-14)
Cc: freebsd-stable@freebsd.org
Subject: Re: Read / write timeouts on SATA disks connected to ICH9
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 15 May 2010 16:26:28 -0000

On Sat, May 15, 2010 at 09:04:11AM +0200, Pieter de Boer wrote:
> Thanks for your elaborate reply, it was very useful to see smartctl
> output explained a bit :) I still think there's something else in
> play beside disk failure. I've checked one of the drives I replaced
> earlier, but that one doesn't have any of the errors in its SMART
> output you described, although it did drop out of the mirror
> multiple times during its lifetime.

That could be caused by a multitude of other known things.  For example,
some Western Digital "Green" drives (including the Enterprise class
ones) are known to perform head parking/offloading excessively, which
could result in the drive spending more time doing that than actually
serving overall I/O requests.  There are some other reports of Samsung
Spinpoint drives experiencing other issues (I've since forgotten and
would have to dig up the threads).

If you could provide full SMART stats for that drive, it might help.

> >The WD Caviar Black drives have a useful feature called TLER -- it's
> >disabled by default, for reasons which I don't want to get into here --
> >which can force the drive to internally give up after X seconds (it's
> >user-selectable) when dealing with such remapping/errors.  The idea is
> >to keep the drive from being deemed dead from the OS/controller's point
> >of view.  I believe Seagate, Hitachi, or Samsung (I forget which) have
> >this feature as well, but it's not called TLER.
>
> I've read about this feature, but didn't have the time to try to get
> it turned on (iirc you'd need a specific Western Digital DOS-based
> util or something).

Yes, it's a DOS-based utility (like most firmware upgraders these days).
I can provide it if you'd like.  I've been meaning to spend some time
trying to reverse-engineer the binary to figure out what ATA commands it
sends to the disk to toggle/adjust the feature (so that one could do it
in real-time rather than have to boot into DOS).

> >If you want to find out the exact LBA that has the problem (there may be
> >more than one), I can step you through performing a selective LBA scan
> >using SMART, since this model of disk does support such.  It's easy to
> >do, easy to understand the results, and can be done while the drive is
> >in operation (though I would recommend trying to keep disk I/O to a
> >minimum during this test).  Let me know.
>
> At a certain point in time I had read errors from specific LBA's on
> ad4. Using dd I was able to pinpoint those to single sectors.

This isn't very effective (dd will read large chunks/amounts of data
(read: multiple LBAs) from the underlying disk at once, rather than the
disk itself performing a per-LBA test).  My opinion is that the "dd
method" should only be used on drives which don't support selective LBA
scanning via SMART.

> Overwriting those sectors with what was on ad6 made them readable
> again. What is odd is that the 'remapped sector' count of ad4 is 0.

What may have happened is that the drive took a while to read certain
LBAs (long enough for the OS/controller to time out), but that internal
drive ECC was used to correct the reads and the sectors therefore *did
not* need to be remapped.  I do see that Attribute 1 on ad4 is non-zero,
which could indicate said situation, but WD doesn't provide Attribute
195 (ECC recovery rate), which could help here.

SMART implementations are usually quite good (particularly in recent WD
drives), but I have seen situations where certain counters are,
erroneously, not being incremented or changed.  I've seen a couple brand
new disks come out of the factory with non-zero values (indicating
someone at the fab forgot to clear them before shipping).  I'd love to
get my hands on a WD utility that zeros out the counters and re-flashes
the drive firmware to rule out any oddities.

It's been proven already that WD will re-uses the same F/W version
number despite some code being changed.  There was a FreeBSD user who
got a F/W fix from WD for the head offloading/parking ordeal (see above,
re: WD GP), and the firmware version between the old and the new were
the same.  Tracking stuff like this down is basically impossible unless
MD5/SHAs of the firmware files can be provided (good luck).

All HD vendors have their own quirks/ordeals right now.  You basically
just have to go with one who works wells for you, then if things start
going downhill, switch to another.  None of them are perfect.

> Still I'd like to know how do perform such a scan.

smartctl -t select,0-max <disk>

This will start a selective LBA scan from LBA 0 to the end of the disk.
If any error is encountered, the scan stops and the error -- including
the LBA where an error was seen -- is output in the SMART self-test and
SMART selective self-test logs.  You can then write down the LBA, and
then re-run the above command replacing "0" with the LBA+1 where the
error was seen.

Here's an example of what a failed selective scan looks like (taken from
a Hitachi disk I just dealt with at work a few weeks ago, starting at
LBA 100000):

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Selective offline   Completed: read failure       90%      4931         6153934

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA     MAX_LBA  CURRENT_TEST_STATUS
    1   100000  1953525167  Completed_read_failure [90% left] (6153934-6219469)

> >># vmstat -i
> >>interrupt                          total       rate
> >>irq23: atapci0                 371021299      10423
>
> The rate is higher than 10000 also at idle. During a gmirror sync
> from ad6 to ad4, it's about 10670.

In your other post, we determined that your interrupt rate dropped to a
completely normal value (1500 during a gmirror scan or rebuild) after a
system reboot.  I'm not surprised a reboot addressed it (for now...).

What this indicates to me is that if a disk falls off the bus on an ICH9
controller in Enhanced (non-AHCI) mode, FreeBSD starts seeing an absurd
number of interrupts generated from the ICH9.  My guess is FreeBSD isn't
doing something correctly with the controller when this happens; maybe
certain commands aren't being sent back to the controller or handling of
certain events are being done improperly when it comes to ICH9 (or
possibly earlier ICH revisions too).  This should be *very* easy to
reproduce.

> >"iostat 1", "iostat -x 1", or "gstat" might come in handy to tell you
> >what kind of disk I/O is going on.  If actual I/O is very little, then
> >something weird is going on with regards to the number of interrupts
> >being seen on IRQ 23.  mav@ might have some ideas, otherwise I'd
> >recommend rebooting the machine and seeing if the number drops.  If so,
> >it may be that the OS has some sort of bug where a disk timing out or
> >falling off the bus causes interrupt problems.  (It's too bad you don't
> >have AHCI on this system.  It handles stuff like this much more
> >elegantly...)
> If mav@ or anyone else doesn't have another insight in the interrupt
> rate, I guess a reboot will at least show if it's persistent or
> related to the errors. I'll try to do a reboot when convenient
> (probably sunday morning or something).

If you see any of your disks on the ICH9 controller fall off the bus or
report ATA errors (doesn't matter what kind), please make note of the
timestamp (should be in the kernel log), and ASAP run "smartctl -a" on
the disk.  You should compare attributes before and after the event.

You might also want to consider using smartd, which can log SMART
attribute changes on its own.  Note that you might have to tune the
arguments in smartd.conf to ignore some attributes which fluctuate
naturally (such as drive temperature and seek error rate).

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |