Date: Sat, 20 Aug 2011 13:19:13 -0700 From: Jeremy Chadwick <freebsd@jdc.parodius.com> To: Dan Langille <dan@langille.org> Cc: freebsd-stable@freebsd.org Subject: Re: bad sector in gmirror HDD Message-ID: <20110820201913.GA39827@icarus.home.lan> In-Reply-To: <20110820195702.GA39109@icarus.home.lan> References: <1B4FC0D8-60E6-49DA-BC52-688052C4DA51@langille.org> <20110819232125.GA4965@icarus.home.lan> <B6B0AD0F-A74C-4F2C-88B0-101443D7831A@langille.org> <20110820032438.GA21925@icarus.home.lan> <4774BC00-F32B-4BF4-A955-3728F885CAA1@langille.org> <20110820195702.GA39109@icarus.home.lan>
next in thread | previous in thread | raw e-mail | index | archive | help
A follow-up given that I just viewed the SMART attribute data at the very bottom of this page as of this writing (Sat Aug 20 13:00:09 PDT 2011): http://beta.freebsddiary.org/smart-fixing-bad-sector.php And I see this: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 020 Pre-fail Always - 2 9 Power_On_Hours 0x0012 059 059 001 Old_age Always - 27440 196 Reallocated_Event_Count 0x0010 099 099 020 Old_age Offline - 1 197 Current_Pending_Sector 0x0032 100 100 020 Old_age Always - 2 198 Offline_Uncorrectable 0x0010 100 253 000 Old_age Offline - 0 These attributes USUALLY mean: 1) Reallocated_Sector_Ct == There are 2 remapped LBAs. 2) Reallocated_Event_Count == There is 1 remapping event which has been noticed (either failure or success). 3) Current_Pending_Sector == There are 2 LBAs which are suspect. Now, given my previous statement about this particular model of drive, Maxtor may have a firmware quirk or other oddities that don't cause Current_Pending_Sector to drop to 0 or Reallocated_Event_Count to match reality. I simply don't know. But keep reading. And remember, this is what we started with: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 020 Pre-fail Always - 1 9 Power_On_Hours 0x0012 059 059 001 Old_age Always - 27416 196 Reallocated_Event_Count 0x0010 100 100 020 Old_age Offline - 0 197 Current_Pending_Sector 0x0032 100 100 020 Old_age Always - 1 198 Offline_Uncorrectable 0x0010 100 253 000 Old_age Offline - 0 Anyway, in the SMART error log, I see 3 entries (2 new ones since the last time I saw the web page): * Error 3 occurred at disk power-on lifetime: 27422 hours (1142 days + 14 hours) 40 59 18 e8 ef 54 e0 Error: UNC 24 sectors at LBA = 0x0054efe8 = 5566440 * Error 2 occurred at disk power-on lifetime: 27421 hours (1142 days + 13 hours) 40 59 18 e8 ef 54 e0 Error: UNC 24 sectors at LBA = 0x0054efe8 = 5566440 * Error 1 occurred at disk power-on lifetime: 27400 hours (1141 days + 16 hours) 40 59 18 e8 ef 54 e0 Error: UNC 24 sectors at LBA = 0x0054efe8 = 5566440 These are all for the same LBA -- 5566440. "Error 1" was something we already saw on the page the first time. So where did the other two come from? Earlier on the web page I saw these commands being executed: sh ./bad_block_scan /dev/ad2 5566400 5566500 <-- will hit bad LBA sh ./bad_block_scan /dev/ad2 5566000 5566500 <-- will hit bad LBA sh ./bad_block_scan /dev/ad2 5560000 5566000 <-- will not hit bad LBA sh ./bad_block_scan /dev/ad2 5560000 5566000 <-- will not hit bad LBA So there's the explanation for the two newly-added entries in the SMART error log. I'm very surprised if bad_block_scan did not echo that it had encountered read errors on LBA 5566440. It should have, unless I left the script in some weird state. The commands to use to verify would be: dd if=/dev/ad2 of=/dev/null bs=512 count=1 skip=5566439 dd if=/dev/ad2 of=/dev/null bs=512 count=1 skip=5566440 dd if=/dev/ad2 of=/dev/null bs=512 count=1 skip=5566441 (I tend to check "around" that LBA area as well, just to make sure, that's why there's 3 commands with -1 and +1 LBAs). One of these should return an I/O error, unless the LBA has been remapped already, in which case it shouldn't. Finally, there's this very interesting piece of information in the SMART self-test log (not selective scan log, but the self-test log; meaning this was the result of "smartctl -t long /dev/ad2" at some point): Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 27416 786767 So it seems this is one of those drives which does do a surface scan on a long test. But that's interesting -- LBA 786767. If that's true, then issuing the same dd commands as above (but with "skip" changed appropriately) should return an I/O error as well. Naturally check the SMART error log for verification. So, it's possible that there are actually two bad LBAs on this drive -- LBA 5566440 and LBA 786767. I simply don't know about the latter, but the former is confirmed in the SMART error log. If either of these LBAs are the ones which Current_Pending_Sector is referring to, then writes to them should be sufficient to induce re-analysis. E.g.: dd if=/dev/zero of=/dev/ad2 bs=512 count=1 seek=5566440 dd if=/dev/zero of=/dev/ad2 bs=512 count=1 seek=786767 The offsets for seek (not skip!!!) should probably be based on what the dd reads done earlier would show. Unless of course what we're seeing is just a batch of LBAs in a small region that are getting worse the more they're read from (possible). No idea if LBA 5566440 and LBA 786767 are anywhere near one another on the physical media. I don't have a way to determine that (way too complex). That's about all the light I can shed on this for now. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110820201913.GA39827>