From owner-freebsd-fs@freebsd.org  Thu Mar 29 17:49:21 2018
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id BC896F6739F
 for <freebsd-fs@mailman.ysv.freebsd.org>; Thu, 29 Mar 2018 17:49:21 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: from mail-it0-x22e.google.com (mail-it0-x22e.google.com
 [IPv6:2607:f8b0:4001:c0b::22e])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 449747ACE3
 for <freebsd-fs@freebsd.org>; Thu, 29 Mar 2018 17:49:21 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: by mail-it0-x22e.google.com with SMTP id r19-v6so9179997itc.0
 for <freebsd-fs@freebsd.org>; Thu, 29 Mar 2018 10:49:21 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=bsdimp-com.20150623.gappssmtp.com; s=20150623;
 h=mime-version:sender:in-reply-to:references:from:date:message-id
 :subject:to:cc;
 bh=eB1mgsp1sun/EQ2JVDwrniEKfhqrBiXUgclACkkkyjg=;
 b=qtq1VfszFFmRqRa2tqS+HgzkTnRYHe4lM8twoL3/skb7g2wzGreQA56ngxvONtsW+G
 XtyGvVFq6F3KQbttfSx8qfNOEdZUojWlSVdiZyV93/R+o4Guw03SM4okwj6e9So0oovv
 m7oRMyeQTNAddeWcRhJOJNZl1KoKkHcbSFpP14dqQR9EZkem6RhyTzgjfDo2NmqshfHL
 cUaBqlghkuUCEj6itaA71H9ixdujz58tL+lzd2M36Wft+M9gwAhwZZ+S5vHhPNwKbQQ1
 om8y1zFW6MTuRXgJADO2hPlk/L3KzL07S2R6D/QJsZ/9coUfnX9clwADteeJCk1cwbkS
 GU3g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:sender:in-reply-to:references:from
 :date:message-id:subject:to:cc;
 bh=eB1mgsp1sun/EQ2JVDwrniEKfhqrBiXUgclACkkkyjg=;
 b=CjMtS/34RCTIlcevCLkZIFGDp0+Cgg/Xz6pRTGmqgCFsfCW/hGMhTDqcyLnWyI/WIB
 xJiHrBXGQepwMiRP5CukcuJF9RoCec89/sNtBfhoUuu3aegByJzZvpGtuFXMWoR6dAQY
 /jd21OaiDdNY4NoR79A1USO8QWwQxn4LS3X3iAk0/375UnjzBPfF4BckqOlb/q3Nehxw
 2pPpePAqn1BBd8pt9aBE0jHX/0Jz742SASlOrrEzVXJkDKPMe64ziju8xztRBoroNwQd
 OBUhhrTByk+kQoGDzoxFtagqWICfubYK0NxFND/P+XPWQTS7lafu21wjmUGvHhE2EHEQ
 GC+g==
X-Gm-Message-State: AElRT7GkJ01hs/tIBKj1XQANUD5Lr/KKL0sUsIB4hm+F416kt+nEBg9L
 js3ElJqmGsNGs/8/gvm9gtvuddTHoFhg7idjAdjlzw==
X-Google-Smtp-Source: AIpwx4/MlyJDYqHe9p4ERX6fzh6ydGKLoDV3FPvYkpdo+7wXJ1t2bAgjfRGZx7aPiOuoW7VGI3pkVDP0B1ZbVPUq4zk=
X-Received: by 2002:a24:b649:: with SMTP id d9-v6mr8847921itj.51.1522345760497; 
 Thu, 29 Mar 2018 10:49:20 -0700 (PDT)
MIME-Version: 1.0
Sender: wlosh@bsdimp.com
Received: by 10.79.203.196 with HTTP; Thu, 29 Mar 2018 10:49:19 -0700 (PDT)
X-Originating-IP: [2603:300b:6:5100:1052:acc7:f9de:2b6d]
In-Reply-To: <21F62A27-17F2-4791-BFD5-99057D197E68@bway.net>
References: <4754cb2f-76bb-a69b-0cf5-eff4d621eb29@callfortesting.org>
 <CAMXt9NbdN119RrHnZHOJD1T+HNLLpzgkKVStyTm=49dopBMoAQ@mail.gmail.com>
 <CAM0tzX1oTWTa0Nes11yXg5x4c30MmxdUyT6M1_c4-PWv2+Qbhw@mail.gmail.com>
 <CAMXt9NYMrtTNqNSx256mcYsPo48xnsa+CCYSoeFLzRsc+fQWMw@mail.gmail.com>
 <CAM0tzX32v2-=saT5iB4WVcsoVOtH+XE0OQoP7hEDB1xE+xk+sg@mail.gmail.com>
 <1d3f2cef-4c37-782e-7938-e0a2eebc8842@quip.cz>
 <A548BC90-815C-4C66-8E27-9A6F7480741D@bway.net>
 <7ED27465-1BC2-4522-873E-9ECE192EB7A2@ultra-secure.de>
 <e54ab9a7-835d-16c7-1fdd-9f8285c0642b@FreeBSD.org>
 <CAM0tzX3RanY=vZbCXTAHB3=kv6aVkuzO5pmwr9g+ZQoe+N1hVg@mail.gmail.com>
 <be4d85ef-1bd4-d666-42cb-41ad1bc67dd8@FreeBSD.org>
 <21F62A27-17F2-4791-BFD5-99057D197E68@bway.net>
From: Warner Losh <imp@bsdimp.com>
Date: Thu, 29 Mar 2018 11:49:19 -0600
X-Google-Sender-Auth: YZHFGWB3m5l-9ZFd1jOxeLgSypo
Message-ID: <CANCZdfoqdtV-WrVCNn6EjV8+ottwN7xHU-TSLUo6TR8-si43NA@mail.gmail.com>
Subject: Re: smart(8) Call for Testing
To: Charles Sprickman <spork@bway.net>
Cc: Lev Serebryakov <lev@freebsd.org>,
 Tom Evans via freebsd-fs <freebsd-fs@freebsd.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.25
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 29 Mar 2018 17:49:22 -0000

On Thu, Mar 29, 2018 at 11:37 AM, Charles Sprickman via freebsd-fs <
freebsd-fs@freebsd.org> wrote:
>
> > But all my dead HDDs were replaced on self-test fail =E2=80=94 it is wh=
at
> > allows me to replace them BEFORE data were lost.
>
> Yep, lots of folks claim the data is useless, but generally I see some
> signs of
> failure before the drive dies, and sometimes those signs are spotted
> because
> smartd is triggering regular self-tests.  And on SSDs, watching the MWI
> seems
> to work very well - these drives are much smarter (no pun intended) than
> spinny
> disks.


SMART lives in that area between "not reliably useful" and "sometimes
interesting". It's a kinda good enough system that kinda sorta signals
things, sometimes, if you are luck.

We've found at $WORK that many of the metrics are suggestive and help us
monitor overall storage health, but only because we look at specific ones,
and look for trends and outliers form the rest of the herd. For that it can
be mildly useful. For example, we found that the %life used jumped suddenly
on some systems that had new firmware deployed and discovered a overly
aggressive writing bug in our control software (to be fair, it was in the
database back end rebalancing tables for each row insert due to bugs in it,
so a 100MB table wound up generating 100GB in writes). We've also used it
to identify certain machines with excessively high write amp which turned
out to be a different issue that was easily fixed. If you know what to look
for, and have a lot of experience with the drives, the SMART data can be
quite useful. So it's useful, but not without some experience and a very
large sample to use to find outliers.

We don't bother to use it for drive failure. While scanning is nice, it's
too invasive to do on a regular basis. Sometimes we use it to force errors
on drives we already suspect of being bad, but usually we run the drive
until it fails then throw the data that was on it away (Work is Netflix
Open Connect caching servers, so we lose nothing if we dump the data since
it's just copies of copies). Once the drive fails (or becomes too
unreliable short of total failure), we fail it in place and just ignore it
from that point forward and suffer from reduced capacity. But failures are
driven by actual I/O errors, not by SMART data.

Warner