FreeBSD Mail Archives

Date:      Thu, 5 Jul 2018 11:50:41 -0600
From:      Alan Somers <asomers@freebsd.org>
To:        "Rodney W. Grimes" <freebsd-rwg@pdx.rh.cn85.dnsmgr.net>
Cc:        Lev Serebryakov <lev@freebsd.org>, FreeBSD Hackers <freebsd-hackers@freebsd.org>,  George Mitchell <george+freebsd@m5p.com>
Subject:   Re: Confusing smartd messages
Message-ID:  <CAOtMX2hWeUQLbfRgh_qAXMyQESKcf-ntRtW=M1huTWAQta9gKA@mail.gmail.com>
In-Reply-To: <201807051743.w65HhsYb048743@pdx.rh.CN85.dnsmgr.net>
References:  <51eb8232-49a7-0b3a-2d0f-9882ebfbfa1d@FreeBSD.org> <201807051743.w65HhsYb048743@pdx.rh.CN85.dnsmgr.net>

On Thu, Jul 5, 2018 at 11:43 AM, Rodney W. Grimes <
freebsd-rwg@pdx.rh.cn85.dnsmgr.net> wrote:

> > On 05.07.2018 3:03, George Mitchell wrote:
> >
> > > which sounds like it confirms the log message above.  The disk is
> > > part of a zraid pool whose "zpool status" also says everything is
> > > okay.  What's the recommended action at this point?     -- George
> >
> >  In my experience it is begin of disk death, even if overall status is
> > PASSED. It could work for month or may be half a year after first
> > Offline_Uncorrectable is detected (it depends on load), but you best bet
> > to replace it ASAP and throw away.
>
> The appearance of pending or offline sector issues indicating
> immanant death should be weighted to drive age.   If the drive
> is young, say less than 100 to 200 hours, I would attribute
> this to marginal sectors at birth of drive that did not get
> caught during drive manufacture and just get them remapped
> and move on.  Many drives have a special state when the
> hours is <100 in that all raw read errors with more than
> N bits in error, before ecc is applied, automatically and
> silently add these to the manufactures remap table.  A very
> similiar thing is used at drive manufacture time to create
> the initial table, basically a "smartctl -t long" that has
> tweaked parameters and logging turned off.
>

The famous Weibull distribution.  I believe the Backblaze reports talk
about it.


>
> If the drive is older than this I would probably attribute
> only 2 to a one time event like emergency power off retract,
> marginal power situation, or shock or vibrtion during write
> and not be too concerned.
>
> If the drive grows additional pending/offline sectors I
> would then start to be concerned.  Without any growth
> though these are almost always one off events caused
> by any of many methods.
>

The OP hasn't watched 100,000 drives age.  Backblaze has.  That's why my
advice is to replace them according to the failure indicators reported by
Backblaze or the manufacturer, without reading too much into the meaning.

-Alan

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2hWeUQLbfRgh_qAXMyQESKcf-ntRtW=M1huTWAQta9gKA>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation