From owner-freebsd-hackers@freebsd.org Fri Jul 6 01:06:18 2018 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 717B11023102 for ; Fri, 6 Jul 2018 01:06:18 +0000 (UTC) (envelope-from freebsd-rwg@pdx.rh.CN85.dnsmgr.net) Received: from pdx.rh.CN85.dnsmgr.net (br1.CN84in.dnsmgr.net [69.59.192.140]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id E95D3804DA; Fri, 6 Jul 2018 01:06:17 +0000 (UTC) (envelope-from freebsd-rwg@pdx.rh.CN85.dnsmgr.net) Received: from pdx.rh.CN85.dnsmgr.net (localhost [127.0.0.1]) by pdx.rh.CN85.dnsmgr.net (8.13.3/8.13.3) with ESMTP id w6616DIm049981; Thu, 5 Jul 2018 18:06:13 -0700 (PDT) (envelope-from freebsd-rwg@pdx.rh.CN85.dnsmgr.net) Received: (from freebsd-rwg@localhost) by pdx.rh.CN85.dnsmgr.net (8.13.3/8.13.3/Submit) id w6616Bs4049980; Thu, 5 Jul 2018 18:06:11 -0700 (PDT) (envelope-from freebsd-rwg) From: "Rodney W. Grimes" Message-Id: <201807060106.w6616Bs4049980@pdx.rh.CN85.dnsmgr.net> Subject: Re: Confusing smartd messages In-Reply-To: To: Alan Somers Date: Thu, 5 Jul 2018 18:06:11 -0700 (PDT) CC: Wojciech Puchar , FreeBSD Hackers , Stefan Blachmann , Lev Serebryakov , George Mitchell X-Mailer: ELM [version 2.4ME+ PL121h (25)] MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=US-ASCII X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.27 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 06 Jul 2018 01:06:18 -0000 [ Charset UTF-8 unsupported, converting... ] > On Thu, Jul 5, 2018 at 12:15 PM, Rodney W. Grimes < > freebsd-rwg@pdx.rh.cn85.dnsmgr.net> wrote: > > > > On Thu, Jul 5, 2018 at 11:03 AM, Wojciech Puchar > > wrote: > > > > > > > > > > >> Rewriting suspicious sectors is useless in this day and age. HDDs and > > > >> SSDs > > > >> already do it internally and have for years. Even healthy sectors get > > > >> > > > > > > > > unreadable sectors cannot be rewritten by drive electronics as it > > doesn't > > > > know what to rewrite. it may possibly remap it but still report read > > error > > > > until some data will be written - unless giving no error and returning > > > > meaningless data is an accepted behaviour. > > > > > > > > > > But if that disk is already managed by ZFS, the pool is redundant, and > > the > > > bad sector is allocated by ZFS, then ZFS will immediately rewrite the > > > unreadable sector. > > > > ZFS, if it gets a re error, will rewrite the unreadable sector > > to a DIFFERENT block, not over the top of the bad spot. > > > > Are you sure? For read errors, I think ZFS rewrites the data in-place, so > it doesn't have to rewrite it on all other members of the same mirror/raid > group. For persistent write errors of course, it would have to move it to > a different LBA as you describe. Your right, I am not sure exactly what happens during a scrub that finds a checksum error, or encounters a low level device I/O error. I was wrongly assuming that given the COW nature of the whole system that it would never overwrite anything. I wonder if you can send ZFS into a loop with a hard write failing sector. > > > > > > > only on write it can be done properly. > > > > > > > > that the HDD/SSD won't fix itself would be a checksum error. Those are > > > >> > > > > > > > > yes and this will happen if you powerdown your disk on write. or get > > some > > > > power spike or other source of noise that would affect electronic > > > > components. > > > > > > > > > > It happens surprisingly rarely. Even on a sudden power loss, the drive > > is > > > usually able to finish its current write operation. When you run into > > > problems would be if the power loss were coincident with a mechanical > > shock > > > that knocks the head off-track, or something like that. > > > > I agree that "power failure" are rare causes of write errors, and an > > idea of how often this might of happened is look at the emergency > > retract counter, if your gettng lots of those you should try to find > > out why and stop that. Vibration has become a serious problem though, > > at todays head flight hight drives are sensitive to this, you can > > even cause a drive to do retires by yelling at it with a loud > > voice :-) Look at the "high fly" counter to see if your getting > > this issue. > > > > > > performing full disk rewrite (so not zfs rebuilds) and THEN looking at > > > > smart stats and THEN performing regular smartctl -t long will tell the > > > > truth. > > > > > > > > which usually is "drive is fine" in my practice. really faulty drive > > will > > > > QUICKLY develop new problems. > > > > > > > > > > Yeah, that should make the error go away. It takes a long time, though. > > > With a SCSI drive, you can get the exact LBAs affected with a "READ > > > DEFECTS" command. But there isn't a vendor-independent equivalent for > > > SATA, unfortunately. > > > > My bitch exactly about ATA missing this. Though there are vendor specific > > commands to get it. > > > > -- > > Rod Grimes > > rgrimes@freebsd.org > > -- Rod Grimes rgrimes@freebsd.org