Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 22 Jan 2020 23:28:22 +0000
From:      bugzilla-noreply@freebsd.org
To:        bugs@FreeBSD.org
Subject:   [Bug 243531] Unstable ena and nvme on AWS
Message-ID:  <bug-243531-227@https.bugs.freebsd.org/bugzilla/>

next in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D243531

            Bug ID: 243531
           Summary: Unstable ena and nvme on AWS
           Product: Base System
           Version: 12.1-RELEASE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: leif@ofWilsonCreek.com

We just recently upgraded our systems on AWS to 12.1, and we're seeing erro=
rs
with nvme0 and ena0. Typically, these errors manifest together. My sample s=
ize
is 34 instances, all t3.medium or r4.large. I have others that are t2.small,
and those are fine being that they don't use ena or nvme drivers. Of the 34,
there's a group of 15 that are almost idle, and never throw these errors. Of
the remaining that have load, only 5 ran without errors and 14 of them threw
these errors scattered about various times in the last 4.5 days. 3 machines
have crashed during this period, so the errors seem to often be nonfatal, b=
ut
not always. So based on that it seems load related, and does not seem isola=
ted
to an occasional hardware problem.

Here's a sample of the log from one of the machines that crashed:

Jan 22 00:17:05 jdas-dev kernel: nvme0: cpl does not map to outstanding cmd
Jan 22 00:17:05 jdas-dev kernel: cdw0:00000000 sqhd:000d sqid:0001 cid:0012=
 p:1
sc:00 sct:0 m:0 dnr:0
Jan 22 00:17:05 jdas-dev kernel: nvme0: Resetting controller due to a timeo=
ut.
Jan 22 00:17:05 jdas-dev kernel: nvme0: resetting controller
Jan 22 00:17:05 jdas-dev kernel: nvme0: temperature threshold not supported
Jan 22 00:17:05 jdas-dev kernel: nvme0: aborting outstanding i/o
Jan 22 00:17:05 jdas-dev kernel: nvme0: resubmitting queued i/o
Jan 22 00:17:05 jdas-dev kernel: nvme0: WRITE sqid:2 cid:0 nsid:1 lba:69175=
6520
len:8
Jan 22 00:17:21 jdas-dev kernel: ena0: The number of lost tx completion is
above the threshold (129 > 128). Reset the device
Jan 22 00:17:21 jdas-dev kernel: ena0: Trigger reset is on
Jan 22 00:17:21 jdas-dev kernel: ena0: device is going DOWN
Jan 22 00:17:21 jdas-dev kernel: ena0: free uncompleted tx mbuf qid 0 idx 0=
x154
Jan 22 00:17:22 jdas-dev kernel: ena0: device is going UP
Jan 22 00:17:22 jdas-dev kernel: ena0: link is UP
Jan 22 00:17:52 jdas-dev kernel: nvme0: Missing interrupt
Jan 22 00:18:46 jdas-dev kernel: nvme0: Missing interrupt
Jan 22 00:19:34 jdas-dev kernel: nvme0: cpl does not map to outstanding cmd
Jan 22 00:19:34 jdas-dev kernel: cdw0:00000000 sqhd:001a sqid:0002 cid:001b=
 p:0
sc:00 sct:0 m:0 dnr:0
Jan 22 00:19:34 jdas-dev kernel: nvme0: Resetting controller due to a timeo=
ut.
Jan 22 00:19:34 jdas-dev kernel: nvme0: resetting controller
Jan 22 00:19:35 jdas-dev kernel: nvme0: temperature threshold not supported
Jan 22 00:19:35 jdas-dev kernel: nvme0: resubmitting queued i/o
Jan 22 00:19:35 jdas-dev kernel: nvme0: WRITE sqid:1 cid:0 nsid:1 lba:40505=
5248
len:8
Jan 22 00:19:35 jdas-dev kernel: nvme0: aborting outstanding i/o

At this point, we rebooted the machine.

Jan 22 09:02:30 jdas-dev kernel: nvme0: Resetting controller due to a timeo=
ut.
Jan 22 09:02:30 jdas-dev kernel: nvme0: resetting controller
Jan 22 09:02:30 jdas-dev kernel: nvme0: temperature threshold not supported
Jan 22 09:02:30 jdas-dev kernel: nvme0: aborting outstanding i/o
Jan 22 09:02:30 jdas-dev kernel: nvme0: DATASET MANAGEMENT sqid:2 cid:27 ns=
id:0
Jan 22 09:02:30 jdas-dev kernel: nvme0: INVALID OPCODE (00/01) sqid:2 cid:27
cdw0:0
Jan 22 09:02:30 jdas-dev kernel: ena0: Keep alive watchdog timeout.
Jan 22 09:02:30 jdas-dev kernel: ena0: Trigger reset is on
Jan 22 09:02:30 jdas-dev kernel: ena0: device is going DOWN
Jan 22 09:02:30 jdas-dev kernel: ena0: ena0: device is going UP
Jan 22 09:02:30 jdas-dev kernel: link is UP
Jan 22 09:02:30 jdas-dev kernel: ena0: Keep alive watchdog timeout.
Jan 22 09:02:30 jdas-dev kernel: ena0: Trigger reset is on
Jan 22 09:02:30 jdas-dev kernel: ena0: device is going DOWN
Jan 22 09:02:30 jdas-dev kernel: ena0: ena0: device is going UP
Jan 22 09:02:30 jdas-dev kernel: link is UP
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: 90 second watchdog timeout expired. Shutdo=
wn
terminated.
Jan 22 09:02:30 jdas-dev kernel: Wed Jan 22 08:58:58 CST 2020
Jan 22 09:02:30 jdas-dev kernel: 2020-01-22T08:59:01.658827-06:00
jdas-dev.aws0.pla-net.cc init 1 - - /etc/rc.shutdown terminated abnormally,
going to single user mode
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: 2020-01-22T08:59:23.108170-06:00
jdas-dev.aws0.pla-net.cc init 1 - - some processes would not die; ps axl
advised
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 445, size: 12288
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 21337, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 372873, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 21854, size: 57344
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 118775, size: 8192
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 315370, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 445, size: 12288
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 21337, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 372873, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 21854, size: 57344
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 118775, size: 8192
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 315370, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 445, size: 12288
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 21337, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 21854, size: 57344
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 372873, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 118775, size: 8192
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 315370, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 445, size: 12288
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 21337, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 21854, size: 57344
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 372873, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 118775, size: 8192
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 315370, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 445, size: 12288
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 21337, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 21854, size: 57344
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 372873, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 118775, size: 8192
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 315370, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 445, size: 12288
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 21337, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 21854, size: 57344
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 372873, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 118775, size: 8192
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 315370, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 445, size: 12288
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj=
: 0,
blkno: 21337, size: 4096
Jan 22 09:02:30 jdas-dev kernel: ---<<BOOT>>---
Jan 22 09:02:30 jdas-dev kernel: Copyright (c) 1992-2019 The FreeBSD Projec=
t.
Jan 22 09:02:30 jdas-dev kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988,
1989, 1991, 1992, 1993, 1994

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-243531-227>