Date: Wed, 22 Jan 2020 23:28:22 +0000 From: bugzilla-noreply@freebsd.org To: bugs@FreeBSD.org Subject: [Bug 243531] Unstable ena and nvme on AWS Message-ID: <bug-243531-227@https.bugs.freebsd.org/bugzilla/>
index | next in thread | raw e-mail
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=243531 Bug ID: 243531 Summary: Unstable ena and nvme on AWS Product: Base System Version: 12.1-RELEASE Hardware: amd64 OS: Any Status: New Severity: Affects Some People Priority: --- Component: kern Assignee: bugs@FreeBSD.org Reporter: leif@ofWilsonCreek.com We just recently upgraded our systems on AWS to 12.1, and we're seeing errors with nvme0 and ena0. Typically, these errors manifest together. My sample size is 34 instances, all t3.medium or r4.large. I have others that are t2.small, and those are fine being that they don't use ena or nvme drivers. Of the 34, there's a group of 15 that are almost idle, and never throw these errors. Of the remaining that have load, only 5 ran without errors and 14 of them threw these errors scattered about various times in the last 4.5 days. 3 machines have crashed during this period, so the errors seem to often be nonfatal, but not always. So based on that it seems load related, and does not seem isolated to an occasional hardware problem. Here's a sample of the log from one of the machines that crashed: Jan 22 00:17:05 jdas-dev kernel: nvme0: cpl does not map to outstanding cmd Jan 22 00:17:05 jdas-dev kernel: cdw0:00000000 sqhd:000d sqid:0001 cid:0012 p:1 sc:00 sct:0 m:0 dnr:0 Jan 22 00:17:05 jdas-dev kernel: nvme0: Resetting controller due to a timeout. Jan 22 00:17:05 jdas-dev kernel: nvme0: resetting controller Jan 22 00:17:05 jdas-dev kernel: nvme0: temperature threshold not supported Jan 22 00:17:05 jdas-dev kernel: nvme0: aborting outstanding i/o Jan 22 00:17:05 jdas-dev kernel: nvme0: resubmitting queued i/o Jan 22 00:17:05 jdas-dev kernel: nvme0: WRITE sqid:2 cid:0 nsid:1 lba:691756520 len:8 Jan 22 00:17:21 jdas-dev kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device Jan 22 00:17:21 jdas-dev kernel: ena0: Trigger reset is on Jan 22 00:17:21 jdas-dev kernel: ena0: device is going DOWN Jan 22 00:17:21 jdas-dev kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x154 Jan 22 00:17:22 jdas-dev kernel: ena0: device is going UP Jan 22 00:17:22 jdas-dev kernel: ena0: link is UP Jan 22 00:17:52 jdas-dev kernel: nvme0: Missing interrupt Jan 22 00:18:46 jdas-dev kernel: nvme0: Missing interrupt Jan 22 00:19:34 jdas-dev kernel: nvme0: cpl does not map to outstanding cmd Jan 22 00:19:34 jdas-dev kernel: cdw0:00000000 sqhd:001a sqid:0002 cid:001b p:0 sc:00 sct:0 m:0 dnr:0 Jan 22 00:19:34 jdas-dev kernel: nvme0: Resetting controller due to a timeout. Jan 22 00:19:34 jdas-dev kernel: nvme0: resetting controller Jan 22 00:19:35 jdas-dev kernel: nvme0: temperature threshold not supported Jan 22 00:19:35 jdas-dev kernel: nvme0: resubmitting queued i/o Jan 22 00:19:35 jdas-dev kernel: nvme0: WRITE sqid:1 cid:0 nsid:1 lba:405055248 len:8 Jan 22 00:19:35 jdas-dev kernel: nvme0: aborting outstanding i/o At this point, we rebooted the machine. Jan 22 09:02:30 jdas-dev kernel: nvme0: Resetting controller due to a timeout. Jan 22 09:02:30 jdas-dev kernel: nvme0: resetting controller Jan 22 09:02:30 jdas-dev kernel: nvme0: temperature threshold not supported Jan 22 09:02:30 jdas-dev kernel: nvme0: aborting outstanding i/o Jan 22 09:02:30 jdas-dev kernel: nvme0: DATASET MANAGEMENT sqid:2 cid:27 nsid:0 Jan 22 09:02:30 jdas-dev kernel: nvme0: INVALID OPCODE (00/01) sqid:2 cid:27 cdw0:0 Jan 22 09:02:30 jdas-dev kernel: ena0: Keep alive watchdog timeout. Jan 22 09:02:30 jdas-dev kernel: ena0: Trigger reset is on Jan 22 09:02:30 jdas-dev kernel: ena0: device is going DOWN Jan 22 09:02:30 jdas-dev kernel: ena0: ena0: device is going UP Jan 22 09:02:30 jdas-dev kernel: link is UP Jan 22 09:02:30 jdas-dev kernel: ena0: Keep alive watchdog timeout. Jan 22 09:02:30 jdas-dev kernel: ena0: Trigger reset is on Jan 22 09:02:30 jdas-dev kernel: ena0: device is going DOWN Jan 22 09:02:30 jdas-dev kernel: ena0: ena0: device is going UP Jan 22 09:02:30 jdas-dev kernel: link is UP Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096 Jan 22 09:02:30 jdas-dev kernel: 90 second watchdog timeout expired. Shutdown terminated. Jan 22 09:02:30 jdas-dev kernel: Wed Jan 22 08:58:58 CST 2020 Jan 22 09:02:30 jdas-dev kernel: 2020-01-22T08:59:01.658827-06:00 jdas-dev.aws0.pla-net.cc init 1 - - /etc/rc.shutdown terminated abnormally, going to single user mode Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096 Jan 22 09:02:30 jdas-dev kernel: 2020-01-22T08:59:23.108170-06:00 jdas-dev.aws0.pla-net.cc init 1 - - some processes would not die; ps axl advised Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 445, size: 12288 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21337, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 372873, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21854, size: 57344 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 118775, size: 8192 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 315370, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 445, size: 12288 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21337, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 372873, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21854, size: 57344 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 118775, size: 8192 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 315370, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 445, size: 12288 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21337, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21854, size: 57344 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 372873, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 118775, size: 8192 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 315370, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 445, size: 12288 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21337, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21854, size: 57344 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 372873, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 118775, size: 8192 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 315370, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 445, size: 12288 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21337, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21854, size: 57344 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 372873, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 118775, size: 8192 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 315370, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 445, size: 12288 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21337, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21854, size: 57344 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 372873, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 118775, size: 8192 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 315370, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 445, size: 12288 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21337, size: 4096 Jan 22 09:02:30 jdas-dev kernel: ---<<BOOT>>--- Jan 22 09:02:30 jdas-dev kernel: Copyright (c) 1992-2019 The FreeBSD Project. Jan 22 09:02:30 jdas-dev kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 -- You are receiving this mail because: You are the assignee for the bug.home | help
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-243531-227>
