Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 22 Jun 2010 00:45:41 -0700
From:      Jeremy Chadwick <freebsd@jdc.parodius.com>
To:        Matthew Lear <matt@bubblegen.co.uk>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: 7.2-RELEASE-p4, IO errors & RAID1 failure
Message-ID:  <20100622074541.GA71157@icarus.home.lan>
In-Reply-To: <1277155992.1860.3.camel@almscliff.bubblegen.co.uk>
References:  <1276844904.7519.19.camel@almscliff.bubblegen.co.uk> <20100618082127.GA34578@icarus.home.lan> <1276876031.7519.39.camel@almscliff.bubblegen.co.uk> <20100618174208.GA47470@icarus.home.lan> <1276889330.2210.44.camel@almscliff.bubblegen.co.uk> <1277155992.1860.3.camel@almscliff.bubblegen.co.uk>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Jun 21, 2010 at 10:33:12PM +0100, Matthew Lear wrote:
> Hello Jeremy. I just wondered if you had any further thoughts on the
> info below. Two new disks arrived over the weekend and I'm still unsure
> if I'm best to replace ad0 or not...
> Much appreciated indeed.
> --  Matt
> 
> On Fri, 2010-06-18 at 20:28 +0100, Matthew Lear wrote:
> > On Fri, 2010-06-18 at 10:42 -0700, Jeremy Chadwick wrote:
> > > On Fri, Jun 18, 2010 at 04:47:11PM +0100, Matthew Lear wrote:
> > > > Hello Jeremy,
> > > > Thanks very much for the feedback.
> > > > 
> > > > [snip]
> > > > > Could you please provide the full output from "smartctl -a /dev/ad0"
> > > > > here?  Your drive may be completely fine and you may not have to swap it
> > > > > at all; hard to say.
> > > > 
> > > > Sure. See below:
> > > > {snip}
> > > 
> > > Your SMART statistics look completely OK.  There's nothing there that
> > > indicates there were any write failures or otherwise.  I'll explain near
> > > the end of the Email how to test a range of LBAs "just in case".
> > 
> > Good. That's what I thought too :-)
> > 
> > > I'll take a moment to point out that the error previously seen was a
> > > timeout during a write transaction (WRITE_DMA48).  Recap:
> > > 
> > > > > > ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=395032335
> > > > > > ad0: FAILURE - WRITE_DMA48 status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=395032335
> > > > > > ar0: WARNING - mirror protection lost. RAID1 array in DEGRADED mode
> > > 
> > > The status codes shown (status=51 and error=10) are hexadecimal.  I'm
> > > pointing this out because they aren't preceded by '0x' or '$' and it
> > > clarifies my next point:
> > > 
> > > NID_NOT_FOUND (bit 4 set in the ATA error field) is referred to as IDNF
> > > per ATA6-ACS specification and onward, so I'll refer to it as that.
> > > (I've always wondered why FreeBSD calls this NID_NOT_FOUND; IDFN stands
> > > for ID Not Found, so what's with the extra "N"?  I've always felt this
> > > is a typo...)
> > > 
> > > Using the ATA8-ACS specification working draft (2007/05/21), since it's
> > > more recent, we see the following:
> > > 
> > >   Section 6.2 - Error field
> > >   Section 6.2.4 - ID Not Found (IDNF) bit
> > > 
> > >   Error bit 4. The IDNF bit shall be set to one if a user-accessible
> > >   address was not found. The IDNF bit shall be set to one if an
> > >   address outside of the range of user-accessible addresses is
> > >   requested when command aborted is not returned (see 4.11.3 and
> > >   6.2.1).
> > > 
> > >   Section 4.11 - Host Protected Area (HPA) feature set
> > >   Section 4.11.3 - 28-bit and 48-bit HPA commands
> > > 
> > >   Any read or write command to an address above the maximum address
> > >   specified by the SET MAX ADDRESS or SET MAX ADDRESS EXT command shall
> > >   cause command completion with the IDNF bit set to one and ERR set to
> > >   one, or command aborted.
> > > 
> > > There's no definition of what "address" means in 6.2.4, but the most
> > > logical (pun intended) guess is an LBA.  This error is returned by the
> > > disk (e.g. not a controller-induced error).  I've mentioned this problem
> > > in the past:
> > > 
> > > http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting
> > > 
> > > I've always read IDNF to mean "OS requested access (read or write) to an
> > > LBA which is out of bounds", where "out of bounds" means "not between 0
> > > and <last LBA>".  How exactly is that possible?  Alexander, do you have
> > > any familiarity with this error code per ATA spec?
> > > 
> > > Matthew, can you provide output from "atacontrol cap ad0"?  Thanks.
> > 
> > Sure thing. See below.
> > [root@meshuga /home/matt]# atacontrol cap ad0
> > 
> > Protocol              SATA revision 2.x
> > device model          WDC WD3200AAKS-00VYA0
> > serial number         WD-WCARW0164427
> > firmware revision     12.01B02
> > cylinders             16383
> > heads                 16
> > sectors/track         63
> > lba supported         268435455 sectors
> > lba48 supported       625142448 sectors
> > dma supported
> > overlap not supported
> > 
> > Feature                      Support  Enable    Value           Vendor
> > write cache                    yes	yes
> > read ahead                     yes	yes
> > Native Command Queuing (NCQ)   yes	 -	31/0x1F
> > Tagged Command Queuing (TCQ)   no	no	31/0x1F
> > SMART                          yes	yes
> > microcode download             yes	yes
> > security                       yes	no
> > power management               yes	yes
> > advanced power management      no	no	0/0x00
> > automatic acoustic management  yes	no	254/0xFE	128/0x80
> > [root@meshuga /home/matt]# 
> > 
> > 
> > > 
> > > Now regarding the LBA tests -- "smartctl -t select,start-end" will do
> > > the trick.  start should be a starting LBA, end should be an ending LBA.
> > > The OS claims that LBA 395032335 is what was requested to be accessed
> > > when the failure happened, so I would recommend picking start/end ranges
> > > around that area.  Remember that a single sector encapsulates a very
> > > large number of blocks (especially given sizes of disks today), so it's
> > > wise to pick a very large range of LBAs.  I would recommend this in your
> > > case:
> > > 
> > > smartctl -t select,390000000,410000000 /dev/ad0
> > 
> > [root@meshuga /home/matt]# smartctl -t
> > select,390000000-410000000 /dev/ad0
> > smartctl 5.39.1 2010-01-28 r3054 [FreeBSD 7.2-RELEASE-p4 i386] (local
> > build)
> > Copyright (C) 2002-10 by Bruce Allen,
> > http://smartmontools.sourceforge.net
> > 
> > === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
> > Sending command: "Execute SMART Selective self-test routine immediately
> > in off-line mode".
> > SPAN         STARTING_LBA           ENDING_LBA
> >    0            390000000            410000000
> > Drive command "Execute SMART Selective self-test routine immediately in
> > off-line mode" successful.
> > Testing has begun.
> > You have mail in /var/mail/matt
> > [root@meshuga /home/matt]#
> > 
> > ... some time passes polling status with smartctl -a /dev/ad0 ...
> > 
> > [root@meshuga /home/matt]# smartctl -a /dev/ad0
> > smartctl 5.39.1 2010-01-28 r3054 [FreeBSD 7.2-RELEASE-p4 i386] (local
> > build)
> > Copyright (C) 2002-10 by Bruce Allen,
> > http://smartmontools.sourceforge.net
> > 
> > === START OF INFORMATION SECTION ===
> > Model Family:     Western Digital Caviar Blue Serial ATA family
> > Device Model:     WDC WD3200AAKS-00VYA0
> > Serial Number:    WD-WCARW0164427
> > Firmware Version: 12.01B02
> > User Capacity:    320,072,933,376 bytes
> > Device is:        In smartctl database [for details use: -P show]
> > ATA Version is:   8
> > ATA Standard is:  Exact ATA specification draft version not indicated
> > Local Time is:    Fri Jun 18 20:20:17 2010 BST
> > SMART support is: Available - device has SMART capability.
> > SMART support is: Enabled
> > 
> > === START OF READ SMART DATA SECTION ===
> > SMART overall-health self-assessment test result: PASSED
> > 
> > General SMART Values:
> > Offline data collection status:  (0x84)	Offline data collection activity
> > 					was suspended by an interrupting command from host.
> > 					Auto Offline Data Collection: Enabled.
> > Self-test execution status:      (   0)	The previous self-test routine
> > completed
> > 					without error or no self-test has ever 
> > 					been run.
> > Total time to complete Offline 
> > data collection: 		 (8400) seconds.
> > Offline data collection
> > capabilities: 			 (0x7b) SMART execute Offline immediate.
> > 					Auto Offline data collection on/off support.
> > 					Suspend Offline collection upon new
> > 					command.
> > 					Offline surface scan supported.
> > 					Self-test supported.
> > 					Conveyance Self-test supported.
> > 					Selective Self-test supported.
> > SMART capabilities:            (0x0003)	Saves SMART data before entering
> > 					power-saving mode.
> > 					Supports SMART auto save timer.
> > Error logging capability:        (0x01)	Error logging supported.
> > 					General Purpose Logging supported.
> > Short self-test routine 
> > recommended polling time: 	 (   2) minutes.
> > Extended self-test routine
> > recommended polling time: 	 ( 100) minutes.
> > Conveyance self-test routine
> > recommended polling time: 	 (   5) minutes.
> > SCT capabilities: 	       (0x303f)	SCT Status supported.
> > 					SCT Feature Control supported.
> > 					SCT Data Table supported.
> > 
> > SMART Attributes Data Structure revision number: 16
> > Vendor Specific SMART Attributes with Thresholds:
> > ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> > UPDATED  WHEN_FAILED RAW_VALUE
> >   1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always
> > -       0
> >   3 Spin_Up_Time            0x0003   218   150   021    Pre-fail  Always
> > -       2100
> >   4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always
> > -       118
> >   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always
> > -       0
> >   7 Seek_Error_Rate         0x000e   200   200   051    Old_age   Always
> > -       0
> >   9 Power_On_Hours          0x0032   088   088   000    Old_age   Always
> > -       9320
> >  10 Spin_Retry_Count        0x0012   100   100   051    Old_age   Always
> > -       0
> >  11 Calibration_Retry_Count 0x0012   100   100   051    Old_age   Always
> > -       0
> >  12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always
> > -       116
> > 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always
> > -       115
> > 193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always
> > -       118
> > 194 Temperature_Celsius     0x0022   108   103   000    Old_age   Always
> > -       39
> > 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always
> > -       0
> > 197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always
> > -       0
> > 198 Offline_Uncorrectable   0x0010   200   200   000    Old_age
> > Offline      -       0
> > 199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always
> > -       0
> > 200 Multi_Zone_Error_Rate   0x0008   200   200   051    Old_age
> > Offline      -       0
> > 
> > SMART Error Log Version: 1
> > No Errors Logged
> > 
> > SMART Self-test log structure revision number 1
> > Num  Test_Description    Status                  Remaining
> > LifeTime(hours)  LBA_of_first_error
> > # 1  Selective offline   Completed without error       00%      9319
> > -
> > # 2  Extended offline    Completed without error       00%      9299
> > -
> > # 3  Short offline       Completed without error       00%      9298
> > -
> > 
> > SMART Selective self-test log data structure revision number 1
> >  SPAN    MIN_LBA    MAX_LBA  CURRENT_TEST_STATUS
> >     1  390000000  410000000  Not_testing
> >     2          0          0  Not_testing
> >     3          0          0  Not_testing
> >     4          0          0  Not_testing
> >     5          0          0  Not_testing
> > Selective self-test flags (0x0):
> >   After scanning selected spans, do NOT read-scan remainder of disk.
> > If Selective self-test is pending on power-up, resume after 0 minute
> > delay.
> > 
> > 
> > > I would highly recommend doing this with the disk not doing any I/O,
> > > though it won't hurt it (it'll just delay the scan).  "smartctl -a" will
> > > show the state of things in the "SMART Selective self-test log" at the
> > > bottom, or somewhere else within the output (depends on the drive).
> > > 
> > > This should, in my opinion, rule out whether or not there's a bad block
> > > or something along those lines within said range.  Given what I believe
> > > IDNF represents, I would say your scan will probably come back clean.
> > > Also remember that the scan performed here is a *disk-level scan*; the
> > > disk firmware itself is doing it (the OS isn't involved).  This helps
> > > rule out any sort of "weird" issues that the OS may be reporting ("hey
> > > man, LBA 8943943983492893428932489324 is bad!"  "Yeah sure it is").
> > >  
> > > > The two devices in the array are on channels 0 and 1. There is indeed a
> > > > second drive on channel 0 (160G). As I said above, I use that as an
> > > > additional back up device but it's not part of the array.
> > > 
> > > Okay, so executing "atacontrol detach ata0" will cause you to lose both
> > > ad0 and ad1.  If you can live with that, then cool.
> > > 
> > > > > What motherboard is this?  Can you change the setting to either
> > > > > "Native", "Enhanced", or (even better) "AHCI"?  I've seen some systems
> > > > > where the Serial ATA option in the BIOS has an "Auto" option, which does
> > > > > totally bizarre things at times.
> > > > 
> > > > I think this has been covered in subsequent postings. I could try it but
> > > > as you say below, I'd like to resolve the disk issue first.
> > > > ...
> > > > > The atacontrol man page covers your situation:
> > > > > ...
> > > > I don't think this is the case for me since ad0 and ad2 are on seperate
> > > > ata channels.
> > > > ...
> > > > Indeed but my hw doesn't have hot-swap capability (at the moment!).
> > > 
> > > That's the problem -- we're not sure if this really is a disk issue.
> > > It's been reported before, others have reported solving it by increasing
> > > ATA timeout values, etc...  But the fact of the matter is, that error
> > > code is being returned by the device.
> > 
> > Indeed. if I look back through my logs the error has happened quite a
> > few times over the course of several weeks before atacontrol took action
> > and offlined the drive:
> > 
> > [root@meshuga /home/matt]# zcat /var/log/messages.0.bz2 | grep ad0
> > Mar 14 21:17:33 meshuga kernel: ad0: 305245MB <WDC WD3200AAKS-00VYA0
> > 12.01B02> at ata0-master SATA300
> > Mar 14 21:17:33 meshuga kernel: GEOM_LABEL: Label for provider ad0s1a is
> > ufsid/47284adaeab4f771.
> > Mar 14 21:17:33 meshuga kernel: GEOM_LABEL: Label for provider ad0s1d is
> > ufsid/47284ae5159318bb.
> > Mar 14 21:17:33 meshuga kernel: GEOM_LABEL: Label for provider ad0s1e is
> > ufsid/47284ada8adb1df9.
> > Mar 14 21:17:33 meshuga kernel: GEOM_LABEL: Label for provider ad0s1f is
> > ufsid/47284adaa202995d.
> > Mar 14 21:17:33 meshuga kernel: ar0: disk0 READY (master) using ad0 at
> > ata0-master
> > Mar 28 17:59:59 meshuga kernel: ad0: 305245MB <WDC WD3200AAKS-00VYA0
> > 12.01B02> at ata0-master SATA300
> > Mar 28 17:59:59 meshuga kernel: GEOM_LABEL: Label for provider ad0s1a is
> > ufsid/47284adaeab4f771.
> > Mar 28 17:59:59 meshuga kernel: GEOM_LABEL: Label for provider ad0s1d is
> > ufsid/47284ae5159318bb.
> > Mar 28 17:59:59 meshuga kernel: GEOM_LABEL: Label for provider ad0s1e is
> > ufsid/47284ada8adb1df9.
> > Mar 28 17:59:59 meshuga kernel: GEOM_LABEL: Label for provider ad0s1f is
> > ufsid/47284adaa202995d.
> > Mar 28 17:59:59 meshuga kernel: ar0: disk0 READY (master) using ad0 at
> > ata0-master
> > May  1 03:03:23 meshuga kernel: ad0: TIMEOUT - READ_DMA retrying (1
> > retry left) LBA=20472527
> > May 30 03:01:54 meshuga kernel: ad0: TIMEOUT - READ_DMA retrying (1
> > retry left) LBA=24996179
> > Jun  1 03:02:47 meshuga kernel: ad0: TIMEOUT - READ_DMA retrying (1
> > retry left) LBA=27745599
> > Jun  1 03:03:35 meshuga kernel: ad0: TIMEOUT - READ_DMA retrying (1
> > retry left) LBA=29127787
> > Jun  1 03:03:41 meshuga kernel: ad0: TIMEOUT - READ_DMA retrying (1
> > retry left) LBA=29131871
> > Jun  1 05:00:10 meshuga kernel: ad0: TIMEOUT - READ_DMA retrying (1
> > retry left) LBA=21646831
> > Jun  2 03:02:02 meshuga kernel: ad0: TIMEOUT - READ_DMA retrying (1
> > retry left) LBA=25019047
> > Jun  2 03:02:09 meshuga kernel: ad0: TIMEOUT - READ_DMA retrying (1
> > retry left) LBA=24262079
> > Jun  2 03:02:25 meshuga kernel: ad0: TIMEOUT - READ_DMA retrying (1
> > retry left) LBA=26503067
> > Jun  2 03:02:45 meshuga kernel: ad0: TIMEOUT - READ_DMA retrying (1
> > retry left) LBA=24991895
> > Jun  2 03:03:22 meshuga kernel: ad0: TIMEOUT - READ_DMA retrying (1
> > retry left) LBA=26526099
> > Jun  2 03:03:34 meshuga kernel: ad0: TIMEOUT - READ_DMA retrying (1
> > retry left) LBA=21229415
> > Jun  2 03:03:46 meshuga kernel: ad0: TIMEOUT - READ_DMA retrying (1
> > retry left) LBA=20498839
> > Jun  2 03:04:07 meshuga kernel: ad0: TIMEOUT - READ_DMA retrying (1
> > retry left) LBA=29996295
> > Jun  2 03:04:14 meshuga kernel: ad0: TIMEOUT - READ_DMA retrying (1
> > retry left) LBA=27617575
> > Jun  2 03:04:40 meshuga kernel: ad0: TIMEOUT - READ_DMA retrying (1
> > retry left) LBA=29939883
> > Jun  2 03:04:48 meshuga kernel: ad0: TIMEOUT - READ_DMA retrying (1
> > retry left) LBA=22747195
> > Jun  7 05:04:22 meshuga kernel: ad0: TIMEOUT - READ_DMA retrying (1
> > retry left) LBA=21220491
> > [r
> > 
> > And most recently:
> > 
> > [root@meshuga /home/matt]# cat /var/log/messages | grep ad0
> > Jun 13 03:01:09 meshuga kernel: ad0: TIMEOUT - READ_DMA retrying (1
> > retry left) LBA=28825859
> > Jun 13 03:01:17 meshuga kernel: ad0: TIMEOUT - READ_DMA retrying (1
> > retry left) LBA=30662831
> > Jun 17 05:00:58 meshuga kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (1
> > retry left) LBA=395032335
> > Jun 17 05:00:58 meshuga kernel: ad0: FAILURE - WRITE_DMA48
> > status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=395032335
> > Jun 17 05:02:20 meshuga kernel: ad0: WARNING - WRITE_DMA taskqueue
> > timeout - completing request directly
> > Jun 17 05:02:20 meshuga kernel: ad0: WARNING - WRITE_DMA48 freeing
> > taskqueue zombie request
> > 
> > On first glance there was a single failure which has progressively got
> > worse... This did make me suspect there was indeed a problem with the
> > drive.
> > 
> > > Speaking generally about disk replacements on your system -- when I say
> > > generally, I do mean generally and *not* in regards to the specific
> > > situation reported:
> > > 
> > > Since there's no AHCI in use, we should just assume that a power-down of
> > > the system is the safest way to go about a disk replacement.  Follow
> > > that procedure in the future and you should be fine.  If you ever get a
> > > hot-swap backplane, you absolutely should use AHCI; hot-swap, especially
> > > on an Intel controller (FreeBSD is tested pretty thoroughly on Intel
> > > ICHxx and ESBx controllers), will work fine in that case.
> > > 
> > > If you do go the AHCI route, and eventually upgrade to RELENG_8 down the
> > > line, I highly recommend you load kernel module ahci.ko (instead of the
> > > default/historic ataahci.ko).  This will get you NCQ support amongst
> > > other things.
> > > 
> > 
> > I've actually got 8.0-RELEASE running on a similar box (no hot-swap back
> > plane) with AHCI. I think it's a newer generation of HP Proliant. That
> > runs RAID1 too. I had been contemplating upgrading motherboards to allow
> > hot-swap. I think I might have more than just a think about it now :-)

I don't really have any other thoughts on the matter, sadly.  I don't
think replacing the disk makes all that much sense since there's no
indication the disk itself is going bad.  I'm really quite adamant about
stating that up front -- there really doesn't appear to be any
indication that the disk is faulty in any way.

Andriy's idea that there may be LBAs written somewhere on the disk (such
as within the filesystem) which refer to incorrect locations is
plausible.  I don't know how to verify the existence of such, however.
fsdb(8) might help with this, but I'm not familiar with the utility.

The recommendations I have, in no particular order, are as follows (
Meaning these are what I'd try):

1) Boot single-user and fsck -f all the filesystems that use /dev/ar0
(not a typo).  There may be some underlying filesystem corruption which
could explain what's going on.  But AFAIK fsck doesn't check integrity
of the data portions of files, only the allocation tables/structures, so
this may come back clean.  ZFS would be able to detect "silent
corruption" of this nature.

If any errors come back, please provide them.  There have been cases in
the past (I can provide mailing list references) where background fsck
didn't catch certain things (which is why I use background_fsck="no" in
rc.conf).  I don't know if that problem/ordeal got fixed or not.

2) Format both disks (ad0 and ad2).  dd if=/dev/zero of=/dev/adX bs=64k
should be sufficient for this.  You really do want to let the dd go from
start to finish (we do know that metadata is stored at the end of the
disk in some cases).  This will take quite some time.  You can do it on
both disks at the same time though.

3) Try upgrading to RELENG_8 (preferably 8.1-RC1) to see if the problem
continues there.  I'd recommend doing item #2 above before doing this,
rather than a 7.x -> 8.1 in-place upgrade.  We do know for certain
that GEOM has changed (e.g. "c" partition no longer exists), so I
tend to advocate formatting rather than using existing filesystems.

4) Switch from ata(4) software RAID-1 to gmirror or ZFS mirror.  I'm
pretty sure FreeBSD can boot from both (not 100% certain on ZFS though;
others can answer that).

5) Replace disk ad0.  This would be my last choice, since there's no
indication the disk is actually at fault nor indication of
disk<->controller communication failures.

Anyone else have ideas/recommendations?

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100622074541.GA71157>