From owner-freebsd-questions@freebsd.org Wed Aug 4 17:35:23 2021 Return-Path: Delivered-To: freebsd-questions@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id F2AEF65225A for ; Wed, 4 Aug 2021 17:35:23 +0000 (UTC) (envelope-from grahamperrin@gmail.com) Received: from mail-wm1-x334.google.com (mail-wm1-x334.google.com [IPv6:2a00:1450:4864:20::334]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4GfzQz2Lg6z4dCG for ; Wed, 4 Aug 2021 17:35:23 +0000 (UTC) (envelope-from grahamperrin@gmail.com) Received: by mail-wm1-x334.google.com with SMTP id n12-20020a05600c3b8cb029025a67bbd40aso4471023wms.0 for ; Wed, 04 Aug 2021 10:35:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:references:from:cc:message-id:date:user-agent :mime-version:in-reply-to:content-transfer-encoding:content-language; bh=A/fE7H7RrZh2h1Oh92YWmB9xMjf1tvfmn89GdNcp3Us=; b=nB2kwZwEBTiwiqaDxkaudcovsewO7dVDm643P29LGmgN/WmjI4zZEi1bfNnUv5FP0r uJnsaQu17rehMP6QdIHPEYc6MthQznF9T0zWGyRiXCVBe20GZkaZxub7s1AmP+t5Z750 3CuBh7Ku8lwOCY5aYKMBA3iTT1jx6lqCyku+4nM5OWhtOOOpvx4Jmg7z6i6q6sC3Ugsj WUeEqtO8g3C0ILpUzr4Uxvw294SSomDvyssuoHJq6Tfk0RZbqvbOm3GP0wl26NGeRCYO KTmHlxc1n01zLL5DKsHaEswbi7ek4u9r+ps31cWlGCb2gTPGYr0R67NG89+Gv4BHVi80 1/tg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:cc:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding :content-language; bh=A/fE7H7RrZh2h1Oh92YWmB9xMjf1tvfmn89GdNcp3Us=; b=FpiYTR0gkFka4Y2wyBsB0dxAf7LhL6rd6Yzc5fLfvsEbH7gUnEuH9RqVRRUouNlak4 PBt4ZzkOAeaerVJJKwQaR2Y6PTA8Xwj+/esyOs5R0/ZyedG0v034Ebn6TD/Ag+K860ZM JZgr6TCBwwAASjvzxrs+gtDxjrcmKBeRgVMM/nBXkq5Rjh2ByTu+0cNAEnYccKtyx4d4 mpXyKQ94F4KMPJxn6+SPoNwtQ0mMTY0ZUnajEEkbX9oSTa8o+RLK3J0lg8T+NhLGbadV BzaQkQ7XEZYMNKI/4Ls1FpMhfHSpudkFuAUY604kKYv2ZqS+J7O8yPEISZXiYm+lyoqj rQdg== X-Gm-Message-State: AOAM5330a9kUvBheOfdsJ/h6t6HEqzzWaeCNmKLnGSVeoVQHCXTaT3v4 CCqtXh9wiEjbTof9CaDdRQVeUU5b4C4= X-Google-Smtp-Source: ABdhPJwYP40JU05UnuGHgzhOf57Vuja75W0ktsmtash4TUT6HmhQqAfQs7fmjlCx1eD82vHn1F9rUQ== X-Received: by 2002:a1c:7dd1:: with SMTP id y200mr667648wmc.83.1628098521393; Wed, 04 Aug 2021 10:35:21 -0700 (PDT) Received: from ?IPv6:2001:470:1f1c:a0::2? (tunnel642390-pt.tunnel.tserv1.lon2.ipv6.he.net. [2001:470:1f1c:a0::2]) by smtp.gmail.com with ESMTPSA id a16sm3405539wrx.7.2021.08.04.10.35.20 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 04 Aug 2021 10:35:20 -0700 (PDT) Subject: Re: nvme detached To: Dan Langille References: From: Graham Perrin Cc: freebsd-questions@freebsd.org Message-ID: <3b332fd8-24be-5a2f-15a8-630edb2a7226@gmail.com> Date: Wed, 4 Aug 2021 18:35:20 +0100 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:78.0) Gecko/20100101 Thunderbird/78.12.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-GB X-Rspamd-Queue-Id: 4GfzQz2Lg6z4dCG X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20161025 header.b=nB2kwZwE; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (mx1.freebsd.org: domain of grahamperrin@gmail.com designates 2a00:1450:4864:20::334 as permitted sender) smtp.mailfrom=grahamperrin@gmail.com X-Spamd-Result: default: False [-4.00 / 15.00]; RCVD_VIA_SMTP_AUTH(0.00)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip6:2a00:1450:4000::/36:c]; FREEMAIL_FROM(0.00)[gmail.com]; RCVD_COUNT_THREE(0.00)[3]; DKIM_TRACE(0.00)[gmail.com:+]; RCPT_COUNT_TWO(0.00)[2]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; NEURAL_HAM_SHORT(-1.00)[-1.000]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; FREEMAIL_ENVFROM(0.00)[gmail.com]; ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US]; MID_RHS_MATCH_FROM(0.00)[]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20161025]; FROM_HAS_DN(0.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-questions@freebsd.org]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_IN_DNSWL_NONE(0.00)[2a00:1450:4864:20::334:from]; RCVD_TLS_ALL(0.00)[]; MAILMAN_DEST(0.00)[freebsd-questions] X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 04 Aug 2021 17:35:24 -0000 On 04/08/2021 18:08, Dan Langille wrote: > Yesterday I had an NVME stick detach. This degraded a zpool but zpools status indicated the device was still online. Yet it was not visible in /dev/. > > More details are at https://gist.github.com/dlangille/bc8af0f5a098d3a106fa5fbf40a88d42 > > I first noticed the issue with multiple ssh sessions freezing up. > > Then Nagios started alerting. A reboot cleared this up. scrubs did not find any errors. > > The /var/log/messages entries below. > > Thank you. > > Aug 3 15:06:02 knew kernel: nvme0: Resetting controller due to a timeout. > Aug 3 15:06:02 knew kernel: nvme0: resetting controller > Aug 3 15:06:32 knew kernel: nvme0: controller ready did not become 0 within 30500 ms > Aug 3 15:06:32 knew kernel: nvme0: failing queued i/o > Aug 3 15:06:32 knew kernel: nvme0: IDENTIFY (06) sqid:0 cid:0 nsid:0 cdw10:00000001 cdw11:00000000 > Aug 3 15:06:32 knew kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:0 cdw0:0 > Aug 3 15:06:32 knew kernel: nvme0: failing outstanding i/o > Aug 3 15:06:32 knew kernel: nvme0: READ sqid:2 cid:123 nsid:1 lba:250153507 len:5 > Aug 3 15:06:32 knew kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:2 cid:123 cdw0:0 > Aug 3 15:06:32 knew kernel: nvme0: failing outstanding i/o > Aug 3 15:06:32 knew kernel: nvme0: WRITE sqid:3 cid:118 nsid:1 lba:454009346 len:1 > Aug 3 15:06:32 knew kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:3 cid:118 cdw0:0 > Aug 3 15:06:32 knew kernel: nvme0: failing outstanding i/o > Aug 3 15:06:32 knew kernel: nvme0: WRITE sqid:4 cid:122 nsid:1 lba:454009345 len:1 > Aug 3 15:06:32 knew kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:122 cdw0:0 > Aug 3 15:06:32 knew kernel: nvd0: detached > The STATE peculiarity aside: if you have a spare, to replace what's currently at nvd0, I should put it in place. Then stress test the removed stick, to tell whether it's good for reuse. A normal run of StressDesk might be enough to expose a problem; I recently had a new drive (less than 100 hours' use) that failed consistently after around seven minutes of the run (before filling the file UFS system).