Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 4 Aug 2021 18:35:20 +0100
From:      Graham Perrin <grahamperrin@gmail.com>
To:        Dan Langille <dan@langille.org>
Cc:        freebsd-questions@freebsd.org
Subject:   Re: nvme detached
Message-ID:  <3b332fd8-24be-5a2f-15a8-630edb2a7226@gmail.com>
In-Reply-To: <a703ce19-ea5d-48ca-8fc6-c1f1418e3131@www.fastmail.com>
References:  <a703ce19-ea5d-48ca-8fc6-c1f1418e3131@www.fastmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On 04/08/2021 18:08, Dan Langille wrote:
> Yesterday I had an NVME stick detach.  This degraded a zpool but zpools status indicated the device was still online. Yet it was not visible in /dev/.
>
> More details are at https://gist.github.com/dlangille/bc8af0f5a098d3a106fa5fbf40a88d42
>
> I first noticed the issue with multiple ssh sessions freezing up.
>
> Then Nagios started alerting. A reboot cleared this up. scrubs did not find any errors.
>
> The /var/log/messages entries below.
>
> Thank you.
>
> Aug  3 15:06:02 knew kernel: nvme0: Resetting controller due to a timeout.
> Aug  3 15:06:02 knew kernel: nvme0: resetting controller
> Aug  3 15:06:32 knew kernel: nvme0: controller ready did not become 0 within 30500 ms
> Aug  3 15:06:32 knew kernel: nvme0: failing queued i/o
> Aug  3 15:06:32 knew kernel: nvme0: IDENTIFY (06) sqid:0 cid:0 nsid:0 cdw10:00000001 cdw11:00000000
> Aug  3 15:06:32 knew kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:0 cdw0:0
> Aug  3 15:06:32 knew kernel: nvme0: failing outstanding i/o
> Aug  3 15:06:32 knew kernel: nvme0: READ sqid:2 cid:123 nsid:1 lba:250153507 len:5
> Aug  3 15:06:32 knew kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:2 cid:123 cdw0:0
> Aug  3 15:06:32 knew kernel: nvme0: failing outstanding i/o
> Aug  3 15:06:32 knew kernel: nvme0: WRITE sqid:3 cid:118 nsid:1 lba:454009346 len:1
> Aug  3 15:06:32 knew kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:3 cid:118 cdw0:0
> Aug  3 15:06:32 knew kernel: nvme0: failing outstanding i/o
> Aug  3 15:06:32 knew kernel: nvme0: WRITE sqid:4 cid:122 nsid:1 lba:454009345 len:1
> Aug  3 15:06:32 knew kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:122 cdw0:0
> Aug  3 15:06:32 knew kernel: nvd0: detached
>
The STATE peculiarity aside: if you have a spare, to replace what's 
currently at nvd0, I should put it in place.

Then stress test the removed stick, to tell whether it's good for reuse.

A normal run of StressDesk might be enough to expose a problem; I 
recently had a new drive (less than 100 hours' use) that failed 
consistently after around seven minutes of the run (before filling the 
file UFS system).




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3b332fd8-24be-5a2f-15a8-630edb2a7226>