Date: Thu, 29 Mar 2018 11:49:19 -0600 From: Warner Losh <imp@bsdimp.com> To: Charles Sprickman <spork@bway.net> Cc: Lev Serebryakov <lev@freebsd.org>, Tom Evans via freebsd-fs <freebsd-fs@freebsd.org> Subject: Re: smart(8) Call for Testing Message-ID: <CANCZdfoqdtV-WrVCNn6EjV8%2BottwN7xHU-TSLUo6TR8-si43NA@mail.gmail.com> In-Reply-To: <21F62A27-17F2-4791-BFD5-99057D197E68@bway.net> References: <4754cb2f-76bb-a69b-0cf5-eff4d621eb29@callfortesting.org> <CAMXt9NbdN119RrHnZHOJD1T%2BHNLLpzgkKVStyTm=49dopBMoAQ@mail.gmail.com> <CAM0tzX1oTWTa0Nes11yXg5x4c30MmxdUyT6M1_c4-PWv2%2BQbhw@mail.gmail.com> <CAMXt9NYMrtTNqNSx256mcYsPo48xnsa%2BCCYSoeFLzRsc%2BfQWMw@mail.gmail.com> <CAM0tzX32v2-=saT5iB4WVcsoVOtH%2BXE0OQoP7hEDB1xE%2Bxk%2Bsg@mail.gmail.com> <1d3f2cef-4c37-782e-7938-e0a2eebc8842@quip.cz> <A548BC90-815C-4C66-8E27-9A6F7480741D@bway.net> <7ED27465-1BC2-4522-873E-9ECE192EB7A2@ultra-secure.de> <e54ab9a7-835d-16c7-1fdd-9f8285c0642b@FreeBSD.org> <CAM0tzX3RanY=vZbCXTAHB3=kv6aVkuzO5pmwr9g%2BZQoe%2BN1hVg@mail.gmail.com> <be4d85ef-1bd4-d666-42cb-41ad1bc67dd8@FreeBSD.org> <21F62A27-17F2-4791-BFD5-99057D197E68@bway.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Mar 29, 2018 at 11:37 AM, Charles Sprickman via freebsd-fs < freebsd-fs@freebsd.org> wrote: > > > But all my dead HDDs were replaced on self-test fail =E2=80=94 it is wh= at > > allows me to replace them BEFORE data were lost. > > Yep, lots of folks claim the data is useless, but generally I see some > signs of > failure before the drive dies, and sometimes those signs are spotted > because > smartd is triggering regular self-tests. And on SSDs, watching the MWI > seems > to work very well - these drives are much smarter (no pun intended) than > spinny > disks. SMART lives in that area between "not reliably useful" and "sometimes interesting". It's a kinda good enough system that kinda sorta signals things, sometimes, if you are luck. We've found at $WORK that many of the metrics are suggestive and help us monitor overall storage health, but only because we look at specific ones, and look for trends and outliers form the rest of the herd. For that it can be mildly useful. For example, we found that the %life used jumped suddenly on some systems that had new firmware deployed and discovered a overly aggressive writing bug in our control software (to be fair, it was in the database back end rebalancing tables for each row insert due to bugs in it, so a 100MB table wound up generating 100GB in writes). We've also used it to identify certain machines with excessively high write amp which turned out to be a different issue that was easily fixed. If you know what to look for, and have a lot of experience with the drives, the SMART data can be quite useful. So it's useful, but not without some experience and a very large sample to use to find outliers. We don't bother to use it for drive failure. While scanning is nice, it's too invasive to do on a regular basis. Sometimes we use it to force errors on drives we already suspect of being bad, but usually we run the drive until it fails then throw the data that was on it away (Work is Netflix Open Connect caching servers, so we lose nothing if we dump the data since it's just copies of copies). Once the drive fails (or becomes too unreliable short of total failure), we fail it in place and just ignore it from that point forward and suffer from reduced capacity. But failures are driven by actual I/O errors, not by SMART data. Warner
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfoqdtV-WrVCNn6EjV8%2BottwN7xHU-TSLUo6TR8-si43NA>