Date: Fri, 27 Mar 2020 10:45:55 +0100 From: Polytropon <freebsd@edvax.de> To: Daniel Feenberg <feenberg@nber.org> Cc: Bob Proulx <bob@proulx.com>, freebsd-questions@freebsd.org Subject: Re: drive selection for disk arrays Message-ID: <20200327104555.1d6d7cd9.freebsd@edvax.de> In-Reply-To: <alpine.BSF.2.21.9999.2003261630030.47777@mail2.nber.org> References: <20200325081814.GK35528@mithril.foucry.net> <713db821-8f69-b41a-75b7-a412a0824c43@holgerdanske.com> <20200326124648725158537@bob.proulx.com> <alpine.BSF.2.21.9999.2003261630030.47777@mail2.nber.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 26 Mar 2020 16:37:58 -0400 (EDT), Daniel Feenberg wrote: > > The disturbing frequency of multiple drives going offline in quick > succession is, in my view, largely a result of defects being discovered in > quick succession, rather than occuring in quick succession. If a defect > occurs in a sector that is rarely visited it can remain hidden for a long > time. During a resilver that defect will be noticed and the drive failed > out. I do think that is an overly aggressive action by the resilvering > process, as that may be the only bad sector, it may be possible to recover > all the data from the remaining drives (if the first failing drive can > read the appropriate sector), and that sector may not even be in an active > file. I'd like to mention something in this context: When a drive _reports_ bad sectors, at least in the past it was an indication that it already _has_ lots of them. The drive's firmware will remap bad sectors to spare sectors, so "no error" so far. When errors are being reported "upwards" ("read error" or "write error" visible to the OS), it's a sign that the disk has run out of spare sectors, and the firmware cannot silently remap _new_ bad sectors... Is this still the case with modern drives? How transparently can ZFS handle drive errors when the drives only report the "top results" (i. e., cannot cope with bad sectors internally anymore)? Do SMART tools help here, for example, by reading certain firmware-provided values that indicate how many sectors _actually_ have been marked as "bad sector", remapped internally, and _not_ reported to the controller / disk I/O subsystem / filesystem yet? This should be a good indicator of "will fail soon", so a replacement can be done while no data loss or other problems appears. -- Polytropon Magdeburg, Germany Happy FreeBSD user since 4.0 Andra moi ennepe, Mousa, ...
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20200327104555.1d6d7cd9.freebsd>