FreeBSD Mail Archives

Date:      Thu, 5 Jul 2018 10:39:45 -0600
From:      Alan Somers <asomers@freebsd.org>
To:        Stefan Blachmann <sblachmann@gmail.com>
Cc:        Wojciech Puchar <wojtek@puchar.net>, FreeBSD Hackers <freebsd-hackers@freebsd.org>,  George Mitchell <george+freebsd@m5p.com>, Lev Serebryakov <lev@freebsd.org>
Subject:   Re: Confusing smartd messages
Message-ID:  <CAOtMX2gG48jzWkPg3kGpSVDC89KY14ta3p-U%2BO5yExHZJfNL7w@mail.gmail.com>
In-Reply-To: <CACc-My36jbL=WWpxOB24D_YLDMofSHAk9JgrP86LKd4MEct1mg@mail.gmail.com>
References:  <dfccd275-954c-11da-1790-e75878f89ad1@m5p.com> <51eb8232-49a7-0b3a-2d0f-9882ebfbfa1d@FreeBSD.org> <alpine.BSF.2.20.1807051642090.17082@puchar.net> <CACc-My36jbL=WWpxOB24D_YLDMofSHAk9JgrP86LKd4MEct1mg@mail.gmail.com>

My advice to the OP is to chill out.  SMART is inconsistently implemented
by different drive vendors and it's very hard to interpret its output.  I
would only recommend replacing a drive based on its SMART status for two
reasons:

1) The drive is under warranty and the vendor agrees to a free replacement
based on the SMART output alone.  The vendors know the meaning of their own
SMART fields better than you do.

2) A large statistical dataset shows that this particular SMART field is
correlated with early failure, for your model of hard drive (or at least a
similar model).  Backblaze maintains one such dataset, which they
periodically publish on their blog.  There are a few other outdated
datasets in the academic literature.  One from AOL, and several from
supercomputer operators.  But Backblaze's is the best because a) it's
current, b) it's large, and b) they have a very diverse set of hard
drives.  Still, even Backblaze can sound a little superstitious (they
replace an entire chassis once several of its drives have had SMART
problems).

https://www.backblaze.com/blog/hard-drive-reliability-q1-2015/

If the drive is not RMAable and you're nervous because you love your data,
then you might consider setting up a hotspare.  zfsd(8) will activate it
the moment that one of your current drives fails.  You can even configure
the hotspare to be spun down most of the time so it won't be affected by
the mechanical shocks or regular wear that the live drives endure.

Rewriting suspicious sectors is useless in this day and age.  HDDs and SSDs
already do it internally and have for years.  Even healthy sectors get
rewritten every now and then due to the adjacent track interference
problem.  About the only kind of problem that could develop on the track
that the HDD/SSD won't fix itself would be a checksum error.  Those are
very rare, and ZFS will fix them immediately.

-Alan "too well versed in hard drive reliability for my own good" Somers

On Thu, Jul 5, 2018 at 10:11 AM, Stefan Blachmann <sblachmann@gmail.com>
wrote:

> Another problem issue is that flash memories also exhibit the charge
> drain problem.
> They cannot be read indefinitely without occasional rewrite, as every
> read drains a minuscule amount of the charge.
>
> I often wished I knew of some OS/driver function/mechanism which can
> rewrite respective refresh media on a mounted+running system and could
> be, for example, run via cron.
>
> Such would not only be very useful to fix pending sectors without
> stopping a running machine, but also for keeping embedded machines'
> flash memories reliably charged over the years.
>
>
>
> On 7/5/18, Wojciech Puchar <wojtek@puchar.net> wrote:
> >>> okay.  What's the recommended action at this point?     -- George
> >>
> >> In my experience it is begin of disk death, even if overall status is
> >> PASSED. It could work for month or may be half a year after first
> >> Offline_Uncorrectable is detected (it depends on load), but you best bet
> >> to replace it ASAP and throw away.
> > well my disk had this and live happily for 3 years.
> >
> > It JUST means that some sectors are unreadable which may be a reason that
> > at some some write got wrong because of hardware problem. But this
> problem
> > may be - and possibly were - powerdown while writing, or power spike.
> >
> > the media itself could be fine. the best action in such case is to force
> > rewrite whole drive with some data.
> >
> > with gmirror it is as easy as first checking second drive for no errors,
> > then forcing remirror.
> > _______________________________________________
> > freebsd-hackers@freebsd.org mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> > To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@
> freebsd.org"
> >
> _______________________________________________
> freebsd-hackers@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"
>

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2gG48jzWkPg3kGpSVDC89KY14ta3p-U%2BO5yExHZJfNL7w>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation