From owner-freebsd-fs@freebsd.org Thu Jun 25 01:48:58 2020 Return-Path: Delivered-To: freebsd-fs@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 449DB349464 for ; Thu, 25 Jun 2020 01:48:58 +0000 (UTC) (envelope-from ggm@algebras.org) Received: from mail-il1-x12a.google.com (mail-il1-x12a.google.com [IPv6:2607:f8b0:4864:20::12a]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 49sjbs1ptGz4LW1 for ; Thu, 25 Jun 2020 01:48:56 +0000 (UTC) (envelope-from ggm@algebras.org) Received: by mail-il1-x12a.google.com with SMTP id j16so3888475ili.9 for ; Wed, 24 Jun 2020 18:48:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=algebras-org.20150623.gappssmtp.com; s=20150623; h=mime-version:from:date:message-id:subject:to; bh=+1aPxdjZ8ziF5O6+qGOOTfUbkrX7N6ZgixoE1guhVq4=; b=hPq/Z9Oruy1UcW7pJy4ni0SxQfNHjZmA36M0PUFLy5L+tHu3C9hf78dQhcfrrURaEp vQZyIXr7SsbjNKaq8saD6RpQOrHU4QVilc2iotRBTASskfKm8t4CNZRT7htt8hweRTt/ 0GKc76fS78QC85lvb1Wr1Szgpez3e2BlOELXaV67k7HQsvovizUFPeX6qKKHMMJPYQWP EdjpiwTsDmI4gl1F0wK3rMlW9qj0npnsAKBSeTA251poQokkAGoeaUNnbYRNcxx6KkAJ DDmQQhITngOwAfUf8NwDGINYn6afJELYakCRGxRgZnko9kHkuC/JsUcdAWttruZOteci Gywg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=+1aPxdjZ8ziF5O6+qGOOTfUbkrX7N6ZgixoE1guhVq4=; b=UZy6wb7ByFAyQXcsW/bjPrpbGuaaDNFEDnWmwH890i0ZibZLmt6ko47l3MZFePRjpD ILQBl9Oq1Y5xAEaN0sLdV0KUFeDOhaXQKCORxe9FSr/cAPdT5y+jQ9+WFScA8sxDLMZI S0a4I79jSehijK1XDQPINPfxlgPI44kmyaEtqvYWxurpPVlNmU1NCrrtejY8pDMRJPPj e2Ywbal5Nlbs4XqqhJZZuQEsmtzkmrSOMJiItjDjy2NYTiw9IiM8dkiZqPmjUTi4bWvt 1W1g1kqVODf5vl8GoXc/vPJydHNJ5KI5l6gIJ81X+hOuzAo8opHN2kJ9JG4h1ARTH8wr F2WQ== X-Gm-Message-State: AOAM533ls8p4neRiqJwrOzeNXLuMfExMRhjWL4AK7+GyTWm0cbpeYJRN uElET5moS9kdi6mCtCbQl4vlkzYWkCpY6e9AEV6Ss6Gs6EfHvQ== X-Google-Smtp-Source: ABdhPJxtiNL8f/fujOLarulHXRz2fxNpkkb7EGToT00vTanFnM7DYcHwvk0sPgSc8HVRujoZpLmtVMDqTKczww1ZHZo= X-Received: by 2002:a05:6e02:118e:: with SMTP id y14mr19006825ili.106.1593049735765; Wed, 24 Jun 2020 18:48:55 -0700 (PDT) MIME-Version: 1.0 From: George Michaelson Date: Thu, 25 Jun 2020 11:48:45 +1000 Message-ID: Subject: Dell HBA, ECC reporting and ZFS ECC in zpool status To: freebsd-fs@freebsd.org Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 49sjbs1ptGz4LW1 X-Spamd-Bar: -- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=algebras-org.20150623.gappssmtp.com header.s=20150623 header.b=hPq/Z9Or; dmarc=none; spf=pass (mx1.freebsd.org: domain of ggm@algebras.org designates 2607:f8b0:4864:20::12a as permitted sender) smtp.mailfrom=ggm@algebras.org X-Spamd-Result: default: False [-2.57 / 15.00]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-0.84)[-0.836]; R_DKIM_ALLOW(-0.20)[algebras-org.20150623.gappssmtp.com:s=20150623]; FROM_HAS_DN(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36]; MIME_GOOD(-0.10)[text/plain]; TO_DN_NONE(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[freebsd-fs@freebsd.org]; RCPT_COUNT_ONE(0.00)[1]; NEURAL_HAM_LONG(-0.94)[-0.936]; DMARC_NA(0.00)[algebras.org]; DKIM_TRACE(0.00)[algebras-org.20150623.gappssmtp.com:+]; NEURAL_HAM_SHORT(-0.30)[-0.300]; RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::12a:from]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; RCVD_COUNT_TWO(0.00)[2]; RCVD_TLS_ALL(0.00)[] X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.33 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 25 Jun 2020 01:48:58 -0000 I have three Dell hosts, 730 and 840 series, with an LSI Dell-ized HBA. All of them got upgraded to 12.1 recently, and then over time started reporting a large number of correctable ECC error states in zpool status. Some of these have turned into unrecoverable errors, and on disk replace demanded multiple scrubs. But, not all. So the ECC report didn't actuall map well to "disk is failing" in a hard sense. But reading Dell I found a web page where they 'fess up that they promote upward corrected ECC states in the drive in a way which *may* be being collected by ZFS to report errors, where there isn't actually a hard 'impending doom' signal coming. I don't actually know this Disk level ECC is what ZFs is reporting to me. I do know that I got high cost, ECC correction load in user space and wound up having to re-scrub to zpool clean repeatedly. https://www.dell.com/support/article/en-au/sln316623/excessive-smart-error-rates-logged-for-read-and-verify-ecc-errors-on-certain-enterprise-hard-drives?lang=en I'm very confused by what to do here. After doing some firmware update, and then zfs scrub I now have cleared error states in the zpool. and by moving to the mrsas driver I can now do SMART on the disks at runtime, but at a cost of not having mrtutil type HBA interactions: I can't mark drives into valid/good state in runtime any more because that control logic doesn't look to be in the mrsas command model. Its camcontrol. Did something change here? the machines were on various states of 11 and 12.0 before this and it never cropped up like this: Millions of ECC corrected events in zpool. We were worried enough to get replacement drives on order, before Dell pointed us to this web page. BTW my track record for PBCK is very high in past times with these lists. If you (dear reader) push back with 'you lack clue to do the job at hand' I would not deny: 40 years a user doesn't make one a sysadmin. -G