Date: Mon, 07 Jun 2010 12:28:42 +0300 From: Andriy Gapon <avg@icyb.net.ua> To: Jeremy Chadwick <freebsd@jdc.parodius.com> Cc: freebsd-fs@freebsd.org Subject: Re: zfs i/o error, no driver error Message-ID: <4C0CBBCA.3050304@icyb.net.ua> In-Reply-To: <20100607090850.GA49166@icarus.home.lan> References: <4C0CAABA.2010506@icyb.net.ua> <20100607083428.GA48419@icarus.home.lan> <4C0CB3FC.8070001@icyb.net.ua> <20100607090850.GA49166@icarus.home.lan>
next in thread | previous in thread | raw e-mail | index | archive | help
on 07/06/2010 12:08 Jeremy Chadwick said the following: > On Mon, Jun 07, 2010 at 11:55:24AM +0300, Andriy Gapon wrote: >> on 07/06/2010 11:34 Jeremy Chadwick said the following: >>> On Mon, Jun 07, 2010 at 11:15:54AM +0300, Andriy Gapon wrote: >>>> During recent zpool scrub one read error was detected and "128K repaired". >>>> >>>> In system log I see the following message: >>>> ZFS: vdev I/O failure, zpool=tank >>>> path=/dev/gptid/536c6f78-e4f3-11de-b9f8-001cc08221ff offset=284456910848 >>>> size=131072 error=5 >>>> >>>> On the other hand, there are no other errors, nothing from geom, ahci, etc. >>>> Why would that happen? What kind of error could this be? >>> I believe this indicates silent data corruption[1], which ZFS can >>> auto-correct if the pool is a mirror or raidz (otherwise it can detect >>> the problem but not fix it). >> This pool is a mirror. >> >>> This can happen for a lot of reasons, but >>> tracking down the source is often difficult. Usually it indicates the >>> disk itself has some kind of problem (cache going bad, some sector >>> remaps which didn't happen or failed, etc.). >> Please note that this is not a CKSUM error, but READ error. > > Okay, then it indicates reading some data off the disk failed. ZFS > auto-corrected it by reading the data from the other member in the pool > (ada0p4). That's confirmed here: Yes, right, of course. If you read my original post you'll see that my question was: why ZFS saw I/O error, but disk/controller/geom/etc driver didn't see it. I do not see us moving towards an answer to that. >> status: One or more devices has experienced an unrecoverable error. An >> attempt was made to correct the error. Applications are unaffected. >> >> NAME STATE READ WRITE CKSUM >> tank ONLINE 0 0 0 >> mirror ONLINE 0 0 0 >> ada0p4 ONLINE 0 0 0 >> gptid/536c6f78-e4f3-11de-b9f8-001cc08221ff ONLINE 1 0 0 128K repaired > >>> - Full "smartctl -a /dev/XXX" for all disk members of zpool "tank" >> Those output for both disks are "perfect". >> I monitor them regularly, also smartd is running and complaints from it. > > Most people I know if do not know how to interpret SMART statistics, and > that's not their fault -- and that's why I requested them. :-) I'll leave this without a comment. > In this > case, I'd like to see "smartctl -a" output for the disk that's > associated with the above GPT ID. There may be some attributes or data > in the SMART error log which could indicate what's going on. smartd > does not know how to interpret data; it just logs what it sees. $ smartctl -a /dev/ada1 smartctl 5.39.1 2010-01-28 r3054 [FreeBSD 8.1-PRERELEASE amd64] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Blue Serial ATA family Device Model: WDC WD5000AAKS-00A7B2 Serial Number: WD-WMASY6905909 Firmware Version: 01.03B01 User Capacity: 500,107,862,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Mon Jun 7 11:53:50 2010 EEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (11160) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 131) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x303f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 169 160 021 Pre-fail Always - 4516 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 53 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 086 086 000 Old_age Always - 10385 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 30 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 25 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 52 194 Temperature_Celsius 0x0022 102 088 000 Old_age Always - 45 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 10331 - # 2 Extended offline Completed without error 00% 10237 - # 3 Short offline Completed without error 00% 10165 - # 4 Short offline Completed without error 00% 9999 - # 5 Short offline Completed without error 00% 9830 - # 6 Short offline Completed without error 00% 9662 - # 7 Extended offline Completed without error 00% 9496 - # 8 Short offline Completed without error 00% 9327 - # 9 Short offline Completed without error 00% 9159 - #10 Short offline Completed without error 00% 8992 - #11 Short offline Completed without error 00% 8824 - #12 Extended offline Completed without error 00% 8778 - #13 Short offline Completed without error 00% 8657 - #14 Short offline Completed without error 00% 8489 - #15 Short offline Completed without error 00% 8154 - #16 Extended offline Completed without error 00% 8036 - #17 Short offline Completed without error 00% 7986 - #18 Short offline Completed without error 00% 7819 - #19 Short offline Completed without error 00% 7651 - #20 Extended offline Completed without error 00% 7366 - #21 Short offline Completed without error 00% 7316 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. -- Andriy Gapon
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4C0CBBCA.3050304>
