From owner-freebsd-hackers@freebsd.org  Thu Jul  5 01:07:58 2018
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id C14B010312CA
 for <freebsd-hackers@mailman.ysv.freebsd.org>;
 Thu,  5 Jul 2018 01:07:58 +0000 (UTC)
 (envelope-from cy.schubert@cschubert.com)
Received: from smtp-out-no.shaw.ca (smtp-out-no.shaw.ca [64.59.134.12])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "Client", Issuer "CA" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 4F7797D1F7
 for <freebsd-hackers@freebsd.org>; Thu,  5 Jul 2018 01:07:58 +0000 (UTC)
 (envelope-from cy.schubert@cschubert.com)
Received: from spqr.komquats.com ([70.67.125.17]) by shaw.ca with ESMTPA
 id askLfpR5POFAwaskMfNSTc; Wed, 04 Jul 2018 19:07:51 -0600
X-Authority-Analysis: v=2.3 cv=Y4XWTCWN c=1 sm=1 tr=0
 a=VFtTW3WuZNDh6VkGe7fA3g==:117 a=VFtTW3WuZNDh6VkGe7fA3g==:17
 a=8nJEP1OIZ-IA:10 a=R9QF1RCXAYgA:10 a=H0GPC0OhAAAA:8 a=YxBL1-UpAAAA:8
 a=6I5d2MoRAAAA:8 a=P1dHc4KmPYKYDttPTekA:9 a=jDGSrDZBlbaO7yn5:21
 a=IpsP_A-6p9QnD2Ye:21 a=wPNLvfGTeEIA:10 a=KczGKrPSgCPlefTG41c3:22
 a=Ia-lj3WSrqcvXOmTRaiG:22 a=IjZwj45LgO3ly-622nXo:22
Received: from slippy.cwsent.com (slippy [10.1.1.91])
 by spqr.komquats.com (Postfix) with ESMTPS id 26A684F2;
 Wed,  4 Jul 2018 18:07:45 -0700 (PDT)
Received: from slippy.cwsent.com (localhost [127.0.0.1])
 by slippy.cwsent.com (8.15.2/8.15.2) with ESMTP id w6517iRA056523;
 Wed, 4 Jul 2018 18:07:44 -0700 (PDT)
 (envelope-from Cy.Schubert@cschubert.com)
Received: from slippy (cy@localhost)
 by slippy.cwsent.com (8.15.2/8.15.2/Submit) with ESMTP id w6517hug056380;
 Wed, 4 Jul 2018 18:07:43 -0700 (PDT)
 (envelope-from Cy.Schubert@cschubert.com)
Message-Id: <201807050107.w6517hug056380@slippy.cwsent.com>
X-Authentication-Warning: slippy.cwsent.com: cy owned process doing -bs
X-Mailer: exmh version 2.8.0 04/21/2012 with nmh-1.7.1
Reply-to: Cy Schubert <Cy.Schubert@cschubert.com>
From: Cy Schubert <Cy.Schubert@cschubert.com>
X-os: FreeBSD
X-Sender: cy@cwsent.com
X-URL: http://www.cschubert.com/
To: Eugene Grosbein <eugen@grosbein.net>
cc: George Mitchell <george+freebsd@m5p.com>,
 FreeBSD Hackers <freebsd-hackers@FreeBSD.org>
Subject: Re: Confusing smartd messages
In-Reply-To: Message from Eugene Grosbein <eugen@grosbein.net>
 of "Thu, 05 Jul 2018 07:42:29 +0700." <5B3D6975.2060508@grosbein.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Date: Wed, 04 Jul 2018 18:07:43 -0700
X-CMAE-Envelope: MS4wfFTuAo3QXFFHKzsq67ZDls9WpOQlI+FwVZylD2r7HqUZu6PJ6BHmzoltR8TG/ZBlZI4yGjqUq6HtCyOt7sMnnbfcLzZvv/ZB/Wqivji0lta1b19N+k3P
 xDeVkQtRveqZJHBNmqHKj7meeFaaltMj0B4pYoytPfPC6fciHGUBxsultcOmr7kUlOTsCqaheGx8XivZBN3BmD2kY/HA8sidMLlFSv2AlJdWz0U+dHUEF/H3
 hY9x9O4dbbp5XjHYzioHsQ==
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.27
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 05 Jul 2018 01:07:59 -0000

In message <5B3D6975.2060508@grosbein.net>, Eugene Grosbein writes:
> 05.07.2018 7:03, George Mitchell пишет:
> > Every thirty minutes, smartd is telling me:
> > 
> > Device: /dev/ada1, 2 Currently unreadable (pending) sectors
> > Device: /dev/ada1, 2 Offline uncorrectable sectors
> > 
> > smartctl -a /dev/ada1 seems to be reassuring me that everything is
> > fine (SMART overall-health self-assessment test result: PASSED),
>
> If that would say FAILED, you should be replacing the disk immediately.
> PASSED does not mean it has no problems, but problems are not fatal (yet).
>
> > though it also says:
> > 
> > 197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always
> >       -       2
> > 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
> > Offline      -       2
> > 
> > which sounds like it confirms the log message above.  The disk is
> > part of a zraid pool whose "zpool status" also says everything is
> > okay.  What's the recommended action at this point?     -- George
>
> You need to force the disk performing rewrite of those two bad sectors.
> There is a possibility they are just an example of "soft bad" and in that eve
> nt
> the problem will just disappear without new remaps, that would be best possbl
> e case.
>
> Or two sectors could happen really bad and remap will "fix" (really hide) the
>  problem,
> in that case you should be ready for possible increasing number of bad sector
> s
> and have a replacement handy.
>
> First step is running zpool scrub or even replace the disk and run "dd if=/de
> v/zero of=/dev/ada1".

A better option would be to determine which blocks had the issue. Then 
use dd if=/dev/ada1 of=/dev/ada1 iseek=<the bad block> oseek=<bad block>
 count=<number of bad blocks>

Alternatively you can dd_rescue -d -s <input block #> -S <output block 
#> /dev/ada1 /dev/ada1

Failing that dd_rescue the whole device. Make sure your zpool has been 
exported. If "repairing" a UFS root filesystem, use single user mode or 
the machine will panic, though no loss of data, just a PITA.


This avoids loss of data.

Ideally your best bet would be to back up the data and write zeros, 
ones, and some random data. This "exercises" each sector such that 
there is less chance of having the same magnetic transitions 
interfering with each other. The reason is that an actuator never 
writes to the same area of disk because of variations in actuator 
movement. Phantom transitions have a slight chance of having effect.

Finally, if after going through this exercise the bad sectors are not 
remapped or clear up only to show up as bad later then replace the 
disk. Of course if your data is critically important then replace the 
disk right away. You don't know how quickly your disk is aging or 
deteriorating until it's too late.

On the positive side, I've been able to resurrect many disks this way. 
If in a critical server (my main machine or firewall) I replace the 
disk immediately, moving the one experiencing errors to a testbed 
machine, one I don't mind losing data as it's easily reproduced or 
replicated from the main machine. Many times the flaky disks don't 
complain while in my testbed for years before dying.

YMMV


-- 
Cheers,
Cy Schubert <Cy.Schubert@cschubert.com>
FreeBSD UNIX:  <cy@FreeBSD.org>   Web:  http://www.FreeBSD.org

	The need of the many outweighs the greed of the few.