From nobody Sat Jul 17 15:46:06 2021
X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 93AB61273369
	for <freebsd-current@mlmmj.nyi.freebsd.org>; Sat, 17 Jul 2021 15:46:18 +0000 (UTC)
	(envelope-from wlosh@bsdimp.com)
Received: from mail-qk1-x735.google.com (mail-qk1-x735.google.com [IPv6:2607:f8b0:4864:20::735])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4GRssQ3K1Sz3N4r
	for <freebsd-current@freebsd.org>; Sat, 17 Jul 2021 15:46:18 +0000 (UTC)
	(envelope-from wlosh@bsdimp.com)
Received: by mail-qk1-x735.google.com with SMTP id z9so11977516qkg.5
        for <freebsd-current@freebsd.org>; Sat, 17 Jul 2021 08:46:18 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=bsdimp-com.20150623.gappssmtp.com; s=20150623;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=ghtg8hZh7wKLkh7rOtoAiNxaJXCtZ+R+pjwp29i7GzY=;
        b=vHSfbahDaDMqbUwi0QNohJFWezfPKeh9gK1tpB8Ulo7h1hEEkobniY3WC3l1WXjoxV
         8kBC3jtk9jr0iohhpADDff5sRSw37Oe4zqky2Nde2aZZmENCiiffJ1i7kAXJmzSBD5ia
         F/LlF/b+qWsJ+H9FSzvDvGut/BXvXgRJKvqQORIChhkFT7TVd8BwKTpkVYZTPIi+kNnX
         0blR2opJwECUe9pBND0IsAfoNA3BJZ+sPC6+QYBY9Sx+pXXq8fpA4SWcQ2WZimLao/v4
         ennnUyEIW6GmR3mjvyRooO80rpkiqH0WQ86dXatU+nPr/KBZyPiWc4hod7vahfZVgSOP
         tsNg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=ghtg8hZh7wKLkh7rOtoAiNxaJXCtZ+R+pjwp29i7GzY=;
        b=K4eQN8NgKO3oZcAEjLr2MFrU+7hp1N8gPAHS5sosZszhc5UOkm15vg0N5AZVhvL28v
         Hn9OUzk3Ox9BqeO+wrTveGZPG5o8rXrIvd7ZHbPeU03D+PDnigwH+6XLZwkrCf0e3gM9
         Tt00bTgHPby/BW+5qxPGdq4846Y0wYyHYp2kQDVwBFGzDAQrwNyRi7M3ITDkLBlIoisC
         Iwjkd6nEZ1kapQ0W9HYaqgDl8KlZJ9jd4Aq52RrUcpSctKSK/o5/Rvo0bVg26pd+sLyQ
         g7wHbxFbkw3ABm8pmkhBm2IYJFE7B+RA5OVCQZdApa/xZfRvwPQ3aQfXkoDS5KSWLqaj
         pT/w==
X-Gm-Message-State: AOAM533S2bB2ibD/OjZBUdoy7CIrdeBjr7RugXXW/8odGUBuLFG9Xch0
	q5vpn3jQFDRuI/EUawOkKxulIdtkWYxYogIw+Ac3Rw==
X-Google-Smtp-Source: ABdhPJykik/aBz2BTyQl4+ym2/SHXrdAB78lV1YygXStdKQfvGANFRT5q4Lt+J6wf/IfilcfNsYyKnVMP/q1J+fhsOE=
X-Received: by 2002:a37:e4f:: with SMTP id 76mr15327307qko.44.1626536777398;
 Sat, 17 Jul 2021 08:46:17 -0700 (PDT)
List-Id: Discussions about the use of FreeBSD-current <freebsd-current.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-current
List-Help: <mailto:freebsd-current+help@freebsd.org>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Subscribe: <mailto:freebsd-current+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-current+unsubscribe@freebsd.org>
Sender: owner-freebsd-current@freebsd.org
MIME-Version: 1.0
References: <994d22b5-c8b7-1183-8198-47b8251e896e@gmail.com>
In-Reply-To: <994d22b5-c8b7-1183-8198-47b8251e896e@gmail.com>
From: Warner Losh <imp@bsdimp.com>
Date: Sat, 17 Jul 2021 09:46:06 -0600
Message-ID: <CANCZdfosGvJvaa04=4FxYuj1chhMyiD162bwksSNJQvoMoxsgw@mail.gmail.com>
Subject: Re: nvme(4) losing control, and subsequent use of fsck_ffs(8) with UFS
To: Graham Perrin <grahamperrin@gmail.com>
Cc: Current FreeBSD <freebsd-current@freebsd.org>
Content-Type: multipart/alternative; boundary="000000000000d4cf5b05c7539a9d"
X-Rspamd-Queue-Id: 4GRssQ3K1Sz3N4r
X-Spamd-Bar: ----
Authentication-Results: mx1.freebsd.org;
	none
X-Spamd-Result: default: False [-4.00 / 15.00];
	 REPLY(-4.00)[]
X-ThisMailContainsUnwantedMimeParts: Y

--000000000000d4cf5b05c7539a9d
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Sat, Jul 17, 2021 at 6:33 AM Graham Perrin <grahamperrin@gmail.com>
wrote:

> When the file system is stress-tested, it seems that the device (an
> internal drive) is lost.
>

This is most likely a drive problem. Netflix pushes half a dozen different
lower-end
models of NVMe drives to their physical limits w/o seeing issues like this.

That said, our screening process screens out several low-quality drives
that just
lose their minds from time to time.


> A recent photograph:
>
> <https://photos.app.goo.gl/wB7gZKLF5PQzusrz7>
>
> Transcribed manually:
>
> nvme0: Resetting controller due to a timeout.
> nvme0: resetting controller
> nvme0: controller ready did not become 0 within 5500 ms
>

Here the controller failed hard. We were unable to reset it within 5
seconds. One might
be able to tweak the timeouts to cope with the drive better. Do you have to
power cycle
to get it to respond again?


> nvme0: failing outstanding i/o
> nvme0: WRITE sqid:2 cid:115 nsid:1 lba:296178856 len:64
> nvme0: ABORTED - BY REQUEST (00/07) sqid:2 cid:115 cdw0:0
> g_vfs_done():nvd0p2[WRITE(offset=3D151370924032, length=3D32768)]error =
=3D 6
> UFS: forcibly unmounting /dev/nvd0p2 from /
> nvme0: failing outstanding i/o
>
> =E2=80=A6 et cetera.
>
> Is this a sure sign of a hardware problem? Or must I do something
> special to gain reliability under stress?
>

It's most likely a hardware problem. that said, I've been working on
patches to
make the recovery when errors like this happen better.


> I don't how to interpret parts of the manual page for nvme(4). There's
> direction to include this line in loader.conf(5):
>
> nvme_load=3D"YES"
>
> =E2=80=93 however when I used kldload(8), it seemed that the module was a=
lready
> loaded, or in kernel.
>

Yes. If you are using it at all, you have the driver.


> Using StressDisk:
>
> <https://github.com/ncw/stressdisk>
>
> =E2=80=93 failures typically occur after around six minutes of testing.
>

Do you have a number of these drives, or is it just this one bad apple?


> The drive is very new, less than 2 TB written:
>
> <https://bsd-hardware.info/?probe=3D7138e2a9e7&log=3Dsmartctl>
>
> I do suspect a hardware problem, because two prior installations of
> Windows 10 became non-bootable.
>

That's likely a huge red flag.


> Also: I find peculiarities with use of fsck_ffs(8), which I can describe
> later. Maybe to be expected, if there's a problem with the drive.
>

You can ask Kirk, but if data isn't written to the drive when the firmware
crashes, then there may be data loss.

Warner

--000000000000d4cf5b05c7539a9d--