From owner-freebsd-fs@FreeBSD.ORG Tue Jul 13 03:10:23 2010 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EDCCE106564A for ; Tue, 13 Jul 2010 03:10:23 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta05.westchester.pa.mail.comcast.net (qmta05.westchester.pa.mail.comcast.net [76.96.62.48]) by mx1.freebsd.org (Postfix) with ESMTP id A4BC98FC17 for ; Tue, 13 Jul 2010 03:10:23 +0000 (UTC) Received: from omta09.westchester.pa.mail.comcast.net ([76.96.62.20]) by qmta05.westchester.pa.mail.comcast.net with comcast id gzRq1e0030SCNGk55FAPCe; Tue, 13 Jul 2010 03:10:23 +0000 Received: from koitsu.dyndns.org ([98.248.41.155]) by omta09.westchester.pa.mail.comcast.net with comcast id hFAN1e0013LrwQ23VFANTT; Tue, 13 Jul 2010 03:10:23 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id B35849B425; Mon, 12 Jul 2010 20:10:20 -0700 (PDT) Date: Mon, 12 Jul 2010 20:10:20 -0700 From: Jeremy Chadwick To: Dmitry Lunts Message-ID: <20100713031020.GA38051@icarus.home.lan> References: <20100712150347.GA12747@icarus.home.lan> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) Cc: freebsd-fs@freebsd.org Subject: Re: fsdb&smartctl&/var/log/messages X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 13 Jul 2010 03:10:24 -0000 (Re-adding the mailing list to the CC list) On Tue, Jul 13, 2010 at 05:15:32AM +0400, Dmitry Lunts wrote: > OK. See below. The output is too long, so General SMART values are skipped. > > [...] > > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE > 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 > 187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 1297 > 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 3 > 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 3 > [...] > ATA Error Count: 1287 (device log contains only the most recent five errors) > [...] And here lies your problem. You have 3 LBAs on your drive which experienced errors during their lifetime and couldn't be automatically corrected. They're labelled as "pending" until some write operations to those LBAs are attempted (and there's no guarantee that will work either (more on that later). Attribute 187 is one I haven't seen before (I don't use Seagate drives), but it indicates the number of read or write transactions to the disk itself which *could not* be auto-corrected with hardware ECC. It's a counter, so it's very possible continuous access to the bad LBAs could be responsible for the counter being so high. Now what's interesting is that your SMART self-test log indicates you actually have 4 bad LBAs: 4007996, 102121619, 110518042, and 195230321: > Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error > # 1 Extended offline Completed: read failure 90% 7391 4007996 > # 2 Extended offline Completed: read failure 90% 7376 195230321 > # 3 Extended offline Completed: read failure 90% 7369 4007996 > # 4 Extended offline Completed: read failure 90% 7346 4007996 > # 6 Extended offline Completed: read failure 90% 7329 110518042 > # 7 Selective offline Completed: read failure 90% 7302 102121619 > # 8 Extended offline Completed: read failure 90% 7301 102121619 > # 9 Extended offline Completed: read failure 90% 7297 102121619 > #10 Selective offline Completed: read failure 90% 6817 195230321 > #11 Selective offline Completed: read failure 90% 6817 195230321 > #12 Extended offline Completed: read failure 50% 6817 195230321 > #15 Extended offline Completed: read failure 50% 5035 195230321 First thing first: I hope you have backups. I realise you're trying to work out what files got damaged, but the easiest way to do that is to attempt to read the files -- try using rsync or cpdup on all the filesystems (write the data to /dev/null) and look for I/O errors. At this point my recommendation to you is simple: replace/RMA the disk. Really. You have I/O errors across three completely non-sequential areas of the disk (maybe dust?). If you don't replace the drive, you're going to end up dealing with this again in the future. I hope you've been doing backups. :-) You can (and should) also run Seagate's SeaTools for DOS utility on the drive -- do an extended/long/thorough test (which will test all the sectors). This is a vendor-specific test which often does things at a much lower level than even SMART. I'm willing to bet the test fails, or at least will give you indication of what you already know. It may also let you remap the LBAs (I know WDs utility can do this). That said, here be dragons. I'm not responsible for what happens after you try this, and I haven't done this in a very VERY long time. Have you tried writing zeros over the LBA where the bad blocks are located? This often will get the drive to attempt a remap. E.g.: dd if=/dev/zero of=/dev/ad6 bs=512 count=1 seek={whatever} sync Be sure to note the of= parameter there refers to the entire drive and not a slice. If it does work, both Attribute 197 and 198 should change to 0. Be sure to run "smartctl -t offline /dev/ad6" too, since some Offline attributes don't always get updated. Also, your calculation formula earlier contains "-63" which I believe is due to the offset of the slices. Except in your bsdlabel output, the "c" slice actually starts at 0, not 63. Are you sure this formula is correct? Let me know what becomes of all this, I'm highly interested. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |