From nobody Sat Jul 17 15:46:06 2021 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 93AB61273369 for ; Sat, 17 Jul 2021 15:46:18 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-qk1-x735.google.com (mail-qk1-x735.google.com [IPv6:2607:f8b0:4864:20::735]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4GRssQ3K1Sz3N4r for ; Sat, 17 Jul 2021 15:46:18 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: by mail-qk1-x735.google.com with SMTP id z9so11977516qkg.5 for ; Sat, 17 Jul 2021 08:46:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=ghtg8hZh7wKLkh7rOtoAiNxaJXCtZ+R+pjwp29i7GzY=; b=vHSfbahDaDMqbUwi0QNohJFWezfPKeh9gK1tpB8Ulo7h1hEEkobniY3WC3l1WXjoxV 8kBC3jtk9jr0iohhpADDff5sRSw37Oe4zqky2Nde2aZZmENCiiffJ1i7kAXJmzSBD5ia F/LlF/b+qWsJ+H9FSzvDvGut/BXvXgRJKvqQORIChhkFT7TVd8BwKTpkVYZTPIi+kNnX 0blR2opJwECUe9pBND0IsAfoNA3BJZ+sPC6+QYBY9Sx+pXXq8fpA4SWcQ2WZimLao/v4 ennnUyEIW6GmR3mjvyRooO80rpkiqH0WQ86dXatU+nPr/KBZyPiWc4hod7vahfZVgSOP tsNg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=ghtg8hZh7wKLkh7rOtoAiNxaJXCtZ+R+pjwp29i7GzY=; b=K4eQN8NgKO3oZcAEjLr2MFrU+7hp1N8gPAHS5sosZszhc5UOkm15vg0N5AZVhvL28v Hn9OUzk3Ox9BqeO+wrTveGZPG5o8rXrIvd7ZHbPeU03D+PDnigwH+6XLZwkrCf0e3gM9 Tt00bTgHPby/BW+5qxPGdq4846Y0wYyHYp2kQDVwBFGzDAQrwNyRi7M3ITDkLBlIoisC Iwjkd6nEZ1kapQ0W9HYaqgDl8KlZJ9jd4Aq52RrUcpSctKSK/o5/Rvo0bVg26pd+sLyQ g7wHbxFbkw3ABm8pmkhBm2IYJFE7B+RA5OVCQZdApa/xZfRvwPQ3aQfXkoDS5KSWLqaj pT/w== X-Gm-Message-State: AOAM533S2bB2ibD/OjZBUdoy7CIrdeBjr7RugXXW/8odGUBuLFG9Xch0 q5vpn3jQFDRuI/EUawOkKxulIdtkWYxYogIw+Ac3Rw== X-Google-Smtp-Source: ABdhPJykik/aBz2BTyQl4+ym2/SHXrdAB78lV1YygXStdKQfvGANFRT5q4Lt+J6wf/IfilcfNsYyKnVMP/q1J+fhsOE= X-Received: by 2002:a37:e4f:: with SMTP id 76mr15327307qko.44.1626536777398; Sat, 17 Jul 2021 08:46:17 -0700 (PDT) List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@freebsd.org MIME-Version: 1.0 References: <994d22b5-c8b7-1183-8198-47b8251e896e@gmail.com> In-Reply-To: <994d22b5-c8b7-1183-8198-47b8251e896e@gmail.com> From: Warner Losh Date: Sat, 17 Jul 2021 09:46:06 -0600 Message-ID: Subject: Re: nvme(4) losing control, and subsequent use of fsck_ffs(8) with UFS To: Graham Perrin Cc: Current FreeBSD Content-Type: multipart/alternative; boundary="000000000000d4cf5b05c7539a9d" X-Rspamd-Queue-Id: 4GRssQ3K1Sz3N4r X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; none X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[] X-ThisMailContainsUnwantedMimeParts: Y --000000000000d4cf5b05c7539a9d Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Sat, Jul 17, 2021 at 6:33 AM Graham Perrin wrote: > When the file system is stress-tested, it seems that the device (an > internal drive) is lost. > This is most likely a drive problem. Netflix pushes half a dozen different lower-end models of NVMe drives to their physical limits w/o seeing issues like this. That said, our screening process screens out several low-quality drives that just lose their minds from time to time. > A recent photograph: > > > > Transcribed manually: > > nvme0: Resetting controller due to a timeout. > nvme0: resetting controller > nvme0: controller ready did not become 0 within 5500 ms > Here the controller failed hard. We were unable to reset it within 5 seconds. One might be able to tweak the timeouts to cope with the drive better. Do you have to power cycle to get it to respond again? > nvme0: failing outstanding i/o > nvme0: WRITE sqid:2 cid:115 nsid:1 lba:296178856 len:64 > nvme0: ABORTED - BY REQUEST (00/07) sqid:2 cid:115 cdw0:0 > g_vfs_done():nvd0p2[WRITE(offset=3D151370924032, length=3D32768)]error = =3D 6 > UFS: forcibly unmounting /dev/nvd0p2 from / > nvme0: failing outstanding i/o > > =E2=80=A6 et cetera. > > Is this a sure sign of a hardware problem? Or must I do something > special to gain reliability under stress? > It's most likely a hardware problem. that said, I've been working on patches to make the recovery when errors like this happen better. > I don't how to interpret parts of the manual page for nvme(4). There's > direction to include this line in loader.conf(5): > > nvme_load=3D"YES" > > =E2=80=93 however when I used kldload(8), it seemed that the module was a= lready > loaded, or in kernel. > Yes. If you are using it at all, you have the driver. > Using StressDisk: > > > > =E2=80=93 failures typically occur after around six minutes of testing. > Do you have a number of these drives, or is it just this one bad apple? > The drive is very new, less than 2 TB written: > > > > I do suspect a hardware problem, because two prior installations of > Windows 10 became non-bootable. > That's likely a huge red flag. > Also: I find peculiarities with use of fsck_ffs(8), which I can describe > later. Maybe to be expected, if there's a problem with the drive. > You can ask Kirk, but if data isn't written to the drive when the firmware crashes, then there may be data loss. Warner --000000000000d4cf5b05c7539a9d--