From owner-freebsd-fs@freebsd.org Sat Mar 31 15:13:26 2018 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id CC09DF6892E for ; Sat, 31 Mar 2018 15:13:25 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-it0-x22b.google.com (mail-it0-x22b.google.com [IPv6:2607:f8b0:4001:c0b::22b]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 5B57987AD7 for ; Sat, 31 Mar 2018 15:13:25 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: by mail-it0-x22b.google.com with SMTP id r19-v6so14580620itc.0 for ; Sat, 31 Mar 2018 08:13:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20150623.gappssmtp.com; s=20150623; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=HPyL6nsIDJPeHhtB57aDL5g5inwhi/gMnxxzdD+E8n0=; b=Vzch5tZmSV4rmAa6fifQI/sPFRH2ybCpOWXO9WRjuJzZHocC4xK0UOpifhe39KBVCW qlpwQQ2c4S9rUjgEsVYVL35OZz2LSFwzM5SlFgeaPNLhoPFJjKg+ftAX5PHJfV69s4aO sN0F4se6H9qSU9BauYxsM0RNvfS4Ki4MMln9Hkzje5vQPRvm59qVvNMlVtDDqOYF/fjD WwGZiJs8fBmqQzcDIXl9GFusaEc4bVa/dpM3MvMA55vsbANvTmQbRlcL3urmyBkBf/dt KfncgeQNeo++j3EV0ifU+HS3/Aq8/vPGUg/GPTSFDbd/IIpggCXLzzibDmeeJG2dAgpU BReg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=HPyL6nsIDJPeHhtB57aDL5g5inwhi/gMnxxzdD+E8n0=; b=eV0D52JRkI7xTxy/p6Q32k/6jAgzAkiFL7jAxUKgwjpbUooDhkbK3uCjXKn4oqfzJm X32Hv2/KcfcxJsO+lXjjPMWE5RYsDfxxC/uSAkdKO41tdUF2Cf/Ss/SvE9xv3Y6gy2jf OO8ZXs/pTrvnan/BebDSwgaQT8c3u84KuFdFGLsJKETKDiRSz1ARL7SfTK1R2FGDNmih W58+03gCxvtRrU9jaPvT63LG4SFfMQuQPpkbnerwR3vCVUkdY4plp6esF/YGAAiF+Ic+ VZiKX/I/uHN6G7p/kPKfU4ZvzaRrPQvfE/lPk2cjkLThYjvce/Aw1Z4hweHJ67BsszQ2 xGCw== X-Gm-Message-State: ALQs6tCIQBaFaa1VNnkFmIYgcagopRzYShvyDWbJ99V6A1wX25pF+3fO JwvBA0A/TquJ4seWHc3o4AbKVbzQGbT+tXtWb4ilqw== X-Google-Smtp-Source: AIpwx4+fpoHiBEgoYZw3JA2VMMeGLPqLctA76UeEcRJIhRgy2ZdI4ECf55QSgaOhMgD+5iI93zYr78y0szkMPJPJ4QY= X-Received: by 2002:a24:19c9:: with SMTP id b192-v6mr6656343itb.1.1522509204499; Sat, 31 Mar 2018 08:13:24 -0700 (PDT) MIME-Version: 1.0 Sender: wlosh@bsdimp.com Received: by 10.79.203.196 with HTTP; Sat, 31 Mar 2018 08:13:23 -0700 (PDT) X-Originating-IP: [2603:300b:6:5100:1052:acc7:f9de:2b6d] In-Reply-To: <4ac57e03-f5c5-d6f1-d7a8-595398f49015@callfortesting.org> References: <4754cb2f-76bb-a69b-0cf5-eff4d621eb29@callfortesting.org> <1d3f2cef-4c37-782e-7938-e0a2eebc8842@quip.cz> <7ED27465-1BC2-4522-873E-9ECE192EB7A2@ultra-secure.de> <4ac57e03-f5c5-d6f1-d7a8-595398f49015@callfortesting.org> From: Warner Losh Date: Sat, 31 Mar 2018 09:13:23 -0600 X-Google-Sender-Auth: DZ8VQ67s3CNSMZE3cqosFzosr44 Message-ID: Subject: Re: smart(8) Call for Testing To: Michael Dexter Cc: Lev Serebryakov , Chuck Tuffli , Tom Evans via freebsd-fs Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.25 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 31 Mar 2018 15:13:26 -0000 On Fri, Mar 30, 2018 at 11:16 PM, Michael Dexter wrote: > On 3/29/18 6:43 AM, Lev Serebryakov wrote: > >> Monitoring of values and alerting is VERY important (number of >> Relocations is main indicator of spinning HDD health and when it raises >> it must be known ASAP) >> > > Another metric that frequently came up during outreach was any sudden > increase in disk latency, usually indicating that you have between one and > 24 hours to replace the device. I am curious what people are doing now to > determine such changes in latency and where they feel such monitoring > should exist in the stack. > Netflix has a monitoring program that uses gstat to gather average latency stats and send them to our centralized data collection data store. It's really only by looking at the long-term trend that you'll see the spike in retries that manifests itself as bigger latencies. One problem with gstat, though, is that it includes software queueing time which for many things is fine, but when you are trying to determine if the spike is due to extra load on the device or some hardware thing, then it becomes bothersome. The CAM I/O scheduler, when the dynamic scheduler is enabled, keeps all kinds of stats about device latency, including a cumulative latency histogram. Those are also useful things to look at. At Netflix, though, we let the disk fail and then mark it as disabled. We don't look for trends to predict possible failure because we have a fail in place model that doesn't care if there's data loss because all the data on the machine is replicated from a central source of truth and can easily be replaced. > As for SNMP and friends, I consider those way up the stack with tools like > smart(8) simply providing a building block. In many ways, our data collection thing at work is an alternative to SNMP. Warner