Date: Sat, 17 Jul 2021 09:46:06 -0600 From: Warner Losh <imp@bsdimp.com> To: Graham Perrin <grahamperrin@gmail.com> Cc: Current FreeBSD <freebsd-current@freebsd.org> Subject: Re: nvme(4) losing control, and subsequent use of fsck_ffs(8) with UFS Message-ID: <CANCZdfosGvJvaa04=4FxYuj1chhMyiD162bwksSNJQvoMoxsgw@mail.gmail.com> In-Reply-To: <994d22b5-c8b7-1183-8198-47b8251e896e@gmail.com> References: <994d22b5-c8b7-1183-8198-47b8251e896e@gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
--000000000000d4cf5b05c7539a9d Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Sat, Jul 17, 2021 at 6:33 AM Graham Perrin <grahamperrin@gmail.com> wrote: > When the file system is stress-tested, it seems that the device (an > internal drive) is lost. > This is most likely a drive problem. Netflix pushes half a dozen different lower-end models of NVMe drives to their physical limits w/o seeing issues like this. That said, our screening process screens out several low-quality drives that just lose their minds from time to time. > A recent photograph: > > <https://photos.app.goo.gl/wB7gZKLF5PQzusrz7> > > Transcribed manually: > > nvme0: Resetting controller due to a timeout. > nvme0: resetting controller > nvme0: controller ready did not become 0 within 5500 ms > Here the controller failed hard. We were unable to reset it within 5 seconds. One might be able to tweak the timeouts to cope with the drive better. Do you have to power cycle to get it to respond again? > nvme0: failing outstanding i/o > nvme0: WRITE sqid:2 cid:115 nsid:1 lba:296178856 len:64 > nvme0: ABORTED - BY REQUEST (00/07) sqid:2 cid:115 cdw0:0 > g_vfs_done():nvd0p2[WRITE(offset=3D151370924032, length=3D32768)]error = =3D 6 > UFS: forcibly unmounting /dev/nvd0p2 from / > nvme0: failing outstanding i/o > > =E2=80=A6 et cetera. > > Is this a sure sign of a hardware problem? Or must I do something > special to gain reliability under stress? > It's most likely a hardware problem. that said, I've been working on patches to make the recovery when errors like this happen better. > I don't how to interpret parts of the manual page for nvme(4). There's > direction to include this line in loader.conf(5): > > nvme_load=3D"YES" > > =E2=80=93 however when I used kldload(8), it seemed that the module was a= lready > loaded, or in kernel. > Yes. If you are using it at all, you have the driver. > Using StressDisk: > > <https://github.com/ncw/stressdisk> > > =E2=80=93 failures typically occur after around six minutes of testing. > Do you have a number of these drives, or is it just this one bad apple? > The drive is very new, less than 2 TB written: > > <https://bsd-hardware.info/?probe=3D7138e2a9e7&log=3Dsmartctl> > > I do suspect a hardware problem, because two prior installations of > Windows 10 became non-bootable. > That's likely a huge red flag. > Also: I find peculiarities with use of fsck_ffs(8), which I can describe > later. Maybe to be expected, if there's a problem with the drive. > You can ask Kirk, but if data isn't written to the drive when the firmware crashes, then there may be data loss. Warner --000000000000d4cf5b05c7539a9d--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfosGvJvaa04=4FxYuj1chhMyiD162bwksSNJQvoMoxsgw>