From owner-freebsd-fs@freebsd.org Thu Mar 29 17:49:21 2018 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id BC896F6739F for ; Thu, 29 Mar 2018 17:49:21 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-it0-x22e.google.com (mail-it0-x22e.google.com [IPv6:2607:f8b0:4001:c0b::22e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 449747ACE3 for ; Thu, 29 Mar 2018 17:49:21 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: by mail-it0-x22e.google.com with SMTP id r19-v6so9179997itc.0 for ; Thu, 29 Mar 2018 10:49:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20150623.gappssmtp.com; s=20150623; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=eB1mgsp1sun/EQ2JVDwrniEKfhqrBiXUgclACkkkyjg=; b=qtq1VfszFFmRqRa2tqS+HgzkTnRYHe4lM8twoL3/skb7g2wzGreQA56ngxvONtsW+G XtyGvVFq6F3KQbttfSx8qfNOEdZUojWlSVdiZyV93/R+o4Guw03SM4okwj6e9So0oovv m7oRMyeQTNAddeWcRhJOJNZl1KoKkHcbSFpP14dqQR9EZkem6RhyTzgjfDo2NmqshfHL cUaBqlghkuUCEj6itaA71H9ixdujz58tL+lzd2M36Wft+M9gwAhwZZ+S5vHhPNwKbQQ1 om8y1zFW6MTuRXgJADO2hPlk/L3KzL07S2R6D/QJsZ/9coUfnX9clwADteeJCk1cwbkS GU3g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=eB1mgsp1sun/EQ2JVDwrniEKfhqrBiXUgclACkkkyjg=; b=CjMtS/34RCTIlcevCLkZIFGDp0+Cgg/Xz6pRTGmqgCFsfCW/hGMhTDqcyLnWyI/WIB xJiHrBXGQepwMiRP5CukcuJF9RoCec89/sNtBfhoUuu3aegByJzZvpGtuFXMWoR6dAQY /jd21OaiDdNY4NoR79A1USO8QWwQxn4LS3X3iAk0/375UnjzBPfF4BckqOlb/q3Nehxw 2pPpePAqn1BBd8pt9aBE0jHX/0Jz742SASlOrrEzVXJkDKPMe64ziju8xztRBoroNwQd OBUhhrTByk+kQoGDzoxFtagqWICfubYK0NxFND/P+XPWQTS7lafu21wjmUGvHhE2EHEQ GC+g== X-Gm-Message-State: AElRT7GkJ01hs/tIBKj1XQANUD5Lr/KKL0sUsIB4hm+F416kt+nEBg9L js3ElJqmGsNGs/8/gvm9gtvuddTHoFhg7idjAdjlzw== X-Google-Smtp-Source: AIpwx4/MlyJDYqHe9p4ERX6fzh6ydGKLoDV3FPvYkpdo+7wXJ1t2bAgjfRGZx7aPiOuoW7VGI3pkVDP0B1ZbVPUq4zk= X-Received: by 2002:a24:b649:: with SMTP id d9-v6mr8847921itj.51.1522345760497; Thu, 29 Mar 2018 10:49:20 -0700 (PDT) MIME-Version: 1.0 Sender: wlosh@bsdimp.com Received: by 10.79.203.196 with HTTP; Thu, 29 Mar 2018 10:49:19 -0700 (PDT) X-Originating-IP: [2603:300b:6:5100:1052:acc7:f9de:2b6d] In-Reply-To: <21F62A27-17F2-4791-BFD5-99057D197E68@bway.net> References: <4754cb2f-76bb-a69b-0cf5-eff4d621eb29@callfortesting.org> <1d3f2cef-4c37-782e-7938-e0a2eebc8842@quip.cz> <7ED27465-1BC2-4522-873E-9ECE192EB7A2@ultra-secure.de> <21F62A27-17F2-4791-BFD5-99057D197E68@bway.net> From: Warner Losh Date: Thu, 29 Mar 2018 11:49:19 -0600 X-Google-Sender-Auth: YZHFGWB3m5l-9ZFd1jOxeLgSypo Message-ID: Subject: Re: smart(8) Call for Testing To: Charles Sprickman Cc: Lev Serebryakov , Tom Evans via freebsd-fs Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.25 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 29 Mar 2018 17:49:22 -0000 On Thu, Mar 29, 2018 at 11:37 AM, Charles Sprickman via freebsd-fs < freebsd-fs@freebsd.org> wrote: > > > But all my dead HDDs were replaced on self-test fail =E2=80=94 it is wh= at > > allows me to replace them BEFORE data were lost. > > Yep, lots of folks claim the data is useless, but generally I see some > signs of > failure before the drive dies, and sometimes those signs are spotted > because > smartd is triggering regular self-tests. And on SSDs, watching the MWI > seems > to work very well - these drives are much smarter (no pun intended) than > spinny > disks. SMART lives in that area between "not reliably useful" and "sometimes interesting". It's a kinda good enough system that kinda sorta signals things, sometimes, if you are luck. We've found at $WORK that many of the metrics are suggestive and help us monitor overall storage health, but only because we look at specific ones, and look for trends and outliers form the rest of the herd. For that it can be mildly useful. For example, we found that the %life used jumped suddenly on some systems that had new firmware deployed and discovered a overly aggressive writing bug in our control software (to be fair, it was in the database back end rebalancing tables for each row insert due to bugs in it, so a 100MB table wound up generating 100GB in writes). We've also used it to identify certain machines with excessively high write amp which turned out to be a different issue that was easily fixed. If you know what to look for, and have a lot of experience with the drives, the SMART data can be quite useful. So it's useful, but not without some experience and a very large sample to use to find outliers. We don't bother to use it for drive failure. While scanning is nice, it's too invasive to do on a regular basis. Sometimes we use it to force errors on drives we already suspect of being bad, but usually we run the drive until it fails then throw the data that was on it away (Work is Netflix Open Connect caching servers, so we lose nothing if we dump the data since it's just copies of copies). Once the drive fails (or becomes too unreliable short of total failure), we fail it in place and just ignore it from that point forward and suffer from reduced capacity. But failures are driven by actual I/O errors, not by SMART data. Warner