Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 17 Jul 2021 09:46:06 -0600
From:      Warner Losh <imp@bsdimp.com>
To:        Graham Perrin <grahamperrin@gmail.com>
Cc:        Current FreeBSD <freebsd-current@freebsd.org>
Subject:   Re: nvme(4) losing control, and subsequent use of fsck_ffs(8) with UFS
Message-ID:  <CANCZdfosGvJvaa04=4FxYuj1chhMyiD162bwksSNJQvoMoxsgw@mail.gmail.com>
In-Reply-To: <994d22b5-c8b7-1183-8198-47b8251e896e@gmail.com>
References:  <994d22b5-c8b7-1183-8198-47b8251e896e@gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help

On Sat, Jul 17, 2021 at 6:33 AM Graham Perrin <grahamperrin@gmail.com>
wrote:

> When the file system is stress-tested, it seems that the device (an
> internal drive) is lost.
>

This is most likely a drive problem. Netflix pushes half a dozen different
lower-end
models of NVMe drives to their physical limits w/o seeing issues like this.

That said, our screening process screens out several low-quality drives
that just
lose their minds from time to time.


> A recent photograph:
>
> <https://photos.app.goo.gl/wB7gZKLF5PQzusrz7>;
>
> Transcribed manually:
>
> nvme0: Resetting controller due to a timeout.
> nvme0: resetting controller
> nvme0: controller ready did not become 0 within 5500 ms
>

Here the controller failed hard. We were unable to reset it within 5
seconds. One might
be able to tweak the timeouts to cope with the drive better. Do you have to
power cycle
to get it to respond again?


> nvme0: failing outstanding i/o
> nvme0: WRITE sqid:2 cid:115 nsid:1 lba:296178856 len:64
> nvme0: ABORTED - BY REQUEST (00/07) sqid:2 cid:115 cdw0:0
> g_vfs_done():nvd0p2[WRITE(offset=151370924032, length=32768)]error = 6
> UFS: forcibly unmounting /dev/nvd0p2 from /
> nvme0: failing outstanding i/o
>
> … et cetera.
>
> Is this a sure sign of a hardware problem? Or must I do something
> special to gain reliability under stress?
>

It's most likely a hardware problem. that said, I've been working on
patches to
make the recovery when errors like this happen better.


> I don't how to interpret parts of the manual page for nvme(4). There's
> direction to include this line in loader.conf(5):
>
> nvme_load="YES"
>
> – however when I used kldload(8), it seemed that the module was already
> loaded, or in kernel.
>

Yes. If you are using it at all, you have the driver.


> Using StressDisk:
>
> <https://github.com/ncw/stressdisk>;
>
> – failures typically occur after around six minutes of testing.
>

Do you have a number of these drives, or is it just this one bad apple?


> The drive is very new, less than 2 TB written:
>
> <https://bsd-hardware.info/?probe=7138e2a9e7&log=smartctl>;
>
> I do suspect a hardware problem, because two prior installations of
> Windows 10 became non-bootable.
>

That's likely a huge red flag.


> Also: I find peculiarities with use of fsck_ffs(8), which I can describe
> later. Maybe to be expected, if there's a problem with the drive.
>

You can ask Kirk, but if data isn't written to the drive when the firmware
crashes, then there may be data loss.

Warner


Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfosGvJvaa04=4FxYuj1chhMyiD162bwksSNJQvoMoxsgw>