From owner-freebsd-hackers@freebsd.org Thu Jul 5 01:07:58 2018 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id C14B010312CA for ; Thu, 5 Jul 2018 01:07:58 +0000 (UTC) (envelope-from cy.schubert@cschubert.com) Received: from smtp-out-no.shaw.ca (smtp-out-no.shaw.ca [64.59.134.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "Client", Issuer "CA" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 4F7797D1F7 for ; Thu, 5 Jul 2018 01:07:58 +0000 (UTC) (envelope-from cy.schubert@cschubert.com) Received: from spqr.komquats.com ([70.67.125.17]) by shaw.ca with ESMTPA id askLfpR5POFAwaskMfNSTc; Wed, 04 Jul 2018 19:07:51 -0600 X-Authority-Analysis: v=2.3 cv=Y4XWTCWN c=1 sm=1 tr=0 a=VFtTW3WuZNDh6VkGe7fA3g==:117 a=VFtTW3WuZNDh6VkGe7fA3g==:17 a=8nJEP1OIZ-IA:10 a=R9QF1RCXAYgA:10 a=H0GPC0OhAAAA:8 a=YxBL1-UpAAAA:8 a=6I5d2MoRAAAA:8 a=P1dHc4KmPYKYDttPTekA:9 a=jDGSrDZBlbaO7yn5:21 a=IpsP_A-6p9QnD2Ye:21 a=wPNLvfGTeEIA:10 a=KczGKrPSgCPlefTG41c3:22 a=Ia-lj3WSrqcvXOmTRaiG:22 a=IjZwj45LgO3ly-622nXo:22 Received: from slippy.cwsent.com (slippy [10.1.1.91]) by spqr.komquats.com (Postfix) with ESMTPS id 26A684F2; Wed, 4 Jul 2018 18:07:45 -0700 (PDT) Received: from slippy.cwsent.com (localhost [127.0.0.1]) by slippy.cwsent.com (8.15.2/8.15.2) with ESMTP id w6517iRA056523; Wed, 4 Jul 2018 18:07:44 -0700 (PDT) (envelope-from Cy.Schubert@cschubert.com) Received: from slippy (cy@localhost) by slippy.cwsent.com (8.15.2/8.15.2/Submit) with ESMTP id w6517hug056380; Wed, 4 Jul 2018 18:07:43 -0700 (PDT) (envelope-from Cy.Schubert@cschubert.com) Message-Id: <201807050107.w6517hug056380@slippy.cwsent.com> X-Authentication-Warning: slippy.cwsent.com: cy owned process doing -bs X-Mailer: exmh version 2.8.0 04/21/2012 with nmh-1.7.1 Reply-to: Cy Schubert From: Cy Schubert X-os: FreeBSD X-Sender: cy@cwsent.com X-URL: http://www.cschubert.com/ To: Eugene Grosbein cc: George Mitchell , FreeBSD Hackers Subject: Re: Confusing smartd messages In-Reply-To: Message from Eugene Grosbein of "Thu, 05 Jul 2018 07:42:29 +0700." <5B3D6975.2060508@grosbein.net> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit Date: Wed, 04 Jul 2018 18:07:43 -0700 X-CMAE-Envelope: MS4wfFTuAo3QXFFHKzsq67ZDls9WpOQlI+FwVZylD2r7HqUZu6PJ6BHmzoltR8TG/ZBlZI4yGjqUq6HtCyOt7sMnnbfcLzZvv/ZB/Wqivji0lta1b19N+k3P xDeVkQtRveqZJHBNmqHKj7meeFaaltMj0B4pYoytPfPC6fciHGUBxsultcOmr7kUlOTsCqaheGx8XivZBN3BmD2kY/HA8sidMLlFSv2AlJdWz0U+dHUEF/H3 hY9x9O4dbbp5XjHYzioHsQ== X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.27 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 05 Jul 2018 01:07:59 -0000 In message <5B3D6975.2060508@grosbein.net>, Eugene Grosbein writes: > 05.07.2018 7:03, George Mitchell пишет: > > Every thirty minutes, smartd is telling me: > > > > Device: /dev/ada1, 2 Currently unreadable (pending) sectors > > Device: /dev/ada1, 2 Offline uncorrectable sectors > > > > smartctl -a /dev/ada1 seems to be reassuring me that everything is > > fine (SMART overall-health self-assessment test result: PASSED), > > If that would say FAILED, you should be replacing the disk immediately. > PASSED does not mean it has no problems, but problems are not fatal (yet). > > > though it also says: > > > > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always > > - 2 > > 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age > > Offline - 2 > > > > which sounds like it confirms the log message above. The disk is > > part of a zraid pool whose "zpool status" also says everything is > > okay. What's the recommended action at this point? -- George > > You need to force the disk performing rewrite of those two bad sectors. > There is a possibility they are just an example of "soft bad" and in that eve > nt > the problem will just disappear without new remaps, that would be best possbl > e case. > > Or two sectors could happen really bad and remap will "fix" (really hide) the > problem, > in that case you should be ready for possible increasing number of bad sector > s > and have a replacement handy. > > First step is running zpool scrub or even replace the disk and run "dd if=/de > v/zero of=/dev/ada1". A better option would be to determine which blocks had the issue. Then use dd if=/dev/ada1 of=/dev/ada1 iseek= oseek= count= Alternatively you can dd_rescue -d -s -S /dev/ada1 /dev/ada1 Failing that dd_rescue the whole device. Make sure your zpool has been exported. If "repairing" a UFS root filesystem, use single user mode or the machine will panic, though no loss of data, just a PITA. This avoids loss of data. Ideally your best bet would be to back up the data and write zeros, ones, and some random data. This "exercises" each sector such that there is less chance of having the same magnetic transitions interfering with each other. The reason is that an actuator never writes to the same area of disk because of variations in actuator movement. Phantom transitions have a slight chance of having effect. Finally, if after going through this exercise the bad sectors are not remapped or clear up only to show up as bad later then replace the disk. Of course if your data is critically important then replace the disk right away. You don't know how quickly your disk is aging or deteriorating until it's too late. On the positive side, I've been able to resurrect many disks this way. If in a critical server (my main machine or firewall) I replace the disk immediately, moving the one experiencing errors to a testbed machine, one I don't mind losing data as it's easily reproduced or replicated from the main machine. Many times the flaky disks don't complain while in my testbed for years before dying. YMMV -- Cheers, Cy Schubert FreeBSD UNIX: Web: http://www.FreeBSD.org The need of the many outweighs the greed of the few.