From owner-freebsd-hackers@freebsd.org Thu Jul 5 17:43:57 2018 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id DADEE1041BC6 for ; Thu, 5 Jul 2018 17:43:57 +0000 (UTC) (envelope-from freebsd-rwg@pdx.rh.CN85.dnsmgr.net) Received: from pdx.rh.CN85.dnsmgr.net (br1.CN84in.dnsmgr.net [69.59.192.140]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4983E8C1AD; Thu, 5 Jul 2018 17:43:57 +0000 (UTC) (envelope-from freebsd-rwg@pdx.rh.CN85.dnsmgr.net) Received: from pdx.rh.CN85.dnsmgr.net (localhost [127.0.0.1]) by pdx.rh.CN85.dnsmgr.net (8.13.3/8.13.3) with ESMTP id w65HhsAY048744; Thu, 5 Jul 2018 10:43:54 -0700 (PDT) (envelope-from freebsd-rwg@pdx.rh.CN85.dnsmgr.net) Received: (from freebsd-rwg@localhost) by pdx.rh.CN85.dnsmgr.net (8.13.3/8.13.3/Submit) id w65HhsYb048743; Thu, 5 Jul 2018 10:43:54 -0700 (PDT) (envelope-from freebsd-rwg) From: "Rodney W. Grimes" Message-Id: <201807051743.w65HhsYb048743@pdx.rh.CN85.dnsmgr.net> Subject: Re: Confusing smartd messages In-Reply-To: <51eb8232-49a7-0b3a-2d0f-9882ebfbfa1d@FreeBSD.org> To: lev@freebsd.org Date: Thu, 5 Jul 2018 10:43:54 -0700 (PDT) CC: George Mitchell , FreeBSD Hackers X-Mailer: ELM [version 2.4ME+ PL121h (25)] MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=US-ASCII X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.27 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 05 Jul 2018 17:43:58 -0000 > On 05.07.2018 3:03, George Mitchell wrote: > > > which sounds like it confirms the log message above. The disk is > > part of a zraid pool whose "zpool status" also says everything is > > okay. What's the recommended action at this point? -- George > > In my experience it is begin of disk death, even if overall status is > PASSED. It could work for month or may be half a year after first > Offline_Uncorrectable is detected (it depends on load), but you best bet > to replace it ASAP and throw away. The appearance of pending or offline sector issues indicating immanant death should be weighted to drive age. If the drive is young, say less than 100 to 200 hours, I would attribute this to marginal sectors at birth of drive that did not get caught during drive manufacture and just get them remapped and move on. Many drives have a special state when the hours is <100 in that all raw read errors with more than N bits in error, before ecc is applied, automatically and silently add these to the manufactures remap table. A very similiar thing is used at drive manufacture time to create the initial table, basically a "smartctl -t long" that has tweaked parameters and logging turned off. If the drive is older than this I would probably attribute only 2 to a one time event like emergency power off retract, marginal power situation, or shock or vibrtion during write and not be too concerned. If the drive grows additional pending/offline sectors I would then start to be concerned. Without any growth though these are almost always one off events caused by any of many methods. -- Rod Grimes rgrimes@freebsd.org