Date: Wed, 22 Jan 2020 23:28:22 +0000 From: bugzilla-noreply@freebsd.org To: bugs@FreeBSD.org Subject: [Bug 243531] Unstable ena and nvme on AWS Message-ID: <bug-243531-227@https.bugs.freebsd.org/bugzilla/>
next in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D243531 Bug ID: 243531 Summary: Unstable ena and nvme on AWS Product: Base System Version: 12.1-RELEASE Hardware: amd64 OS: Any Status: New Severity: Affects Some People Priority: --- Component: kern Assignee: bugs@FreeBSD.org Reporter: leif@ofWilsonCreek.com We just recently upgraded our systems on AWS to 12.1, and we're seeing erro= rs with nvme0 and ena0. Typically, these errors manifest together. My sample s= ize is 34 instances, all t3.medium or r4.large. I have others that are t2.small, and those are fine being that they don't use ena or nvme drivers. Of the 34, there's a group of 15 that are almost idle, and never throw these errors. Of the remaining that have load, only 5 ran without errors and 14 of them threw these errors scattered about various times in the last 4.5 days. 3 machines have crashed during this period, so the errors seem to often be nonfatal, b= ut not always. So based on that it seems load related, and does not seem isola= ted to an occasional hardware problem. Here's a sample of the log from one of the machines that crashed: Jan 22 00:17:05 jdas-dev kernel: nvme0: cpl does not map to outstanding cmd Jan 22 00:17:05 jdas-dev kernel: cdw0:00000000 sqhd:000d sqid:0001 cid:0012= p:1 sc:00 sct:0 m:0 dnr:0 Jan 22 00:17:05 jdas-dev kernel: nvme0: Resetting controller due to a timeo= ut. Jan 22 00:17:05 jdas-dev kernel: nvme0: resetting controller Jan 22 00:17:05 jdas-dev kernel: nvme0: temperature threshold not supported Jan 22 00:17:05 jdas-dev kernel: nvme0: aborting outstanding i/o Jan 22 00:17:05 jdas-dev kernel: nvme0: resubmitting queued i/o Jan 22 00:17:05 jdas-dev kernel: nvme0: WRITE sqid:2 cid:0 nsid:1 lba:69175= 6520 len:8 Jan 22 00:17:21 jdas-dev kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device Jan 22 00:17:21 jdas-dev kernel: ena0: Trigger reset is on Jan 22 00:17:21 jdas-dev kernel: ena0: device is going DOWN Jan 22 00:17:21 jdas-dev kernel: ena0: free uncompleted tx mbuf qid 0 idx 0= x154 Jan 22 00:17:22 jdas-dev kernel: ena0: device is going UP Jan 22 00:17:22 jdas-dev kernel: ena0: link is UP Jan 22 00:17:52 jdas-dev kernel: nvme0: Missing interrupt Jan 22 00:18:46 jdas-dev kernel: nvme0: Missing interrupt Jan 22 00:19:34 jdas-dev kernel: nvme0: cpl does not map to outstanding cmd Jan 22 00:19:34 jdas-dev kernel: cdw0:00000000 sqhd:001a sqid:0002 cid:001b= p:0 sc:00 sct:0 m:0 dnr:0 Jan 22 00:19:34 jdas-dev kernel: nvme0: Resetting controller due to a timeo= ut. Jan 22 00:19:34 jdas-dev kernel: nvme0: resetting controller Jan 22 00:19:35 jdas-dev kernel: nvme0: temperature threshold not supported Jan 22 00:19:35 jdas-dev kernel: nvme0: resubmitting queued i/o Jan 22 00:19:35 jdas-dev kernel: nvme0: WRITE sqid:1 cid:0 nsid:1 lba:40505= 5248 len:8 Jan 22 00:19:35 jdas-dev kernel: nvme0: aborting outstanding i/o At this point, we rebooted the machine. Jan 22 09:02:30 jdas-dev kernel: nvme0: Resetting controller due to a timeo= ut. Jan 22 09:02:30 jdas-dev kernel: nvme0: resetting controller Jan 22 09:02:30 jdas-dev kernel: nvme0: temperature threshold not supported Jan 22 09:02:30 jdas-dev kernel: nvme0: aborting outstanding i/o Jan 22 09:02:30 jdas-dev kernel: nvme0: DATASET MANAGEMENT sqid:2 cid:27 ns= id:0 Jan 22 09:02:30 jdas-dev kernel: nvme0: INVALID OPCODE (00/01) sqid:2 cid:27 cdw0:0 Jan 22 09:02:30 jdas-dev kernel: ena0: Keep alive watchdog timeout. Jan 22 09:02:30 jdas-dev kernel: ena0: Trigger reset is on Jan 22 09:02:30 jdas-dev kernel: ena0: device is going DOWN Jan 22 09:02:30 jdas-dev kernel: ena0: ena0: device is going UP Jan 22 09:02:30 jdas-dev kernel: link is UP Jan 22 09:02:30 jdas-dev kernel: ena0: Keep alive watchdog timeout. Jan 22 09:02:30 jdas-dev kernel: ena0: Trigger reset is on Jan 22 09:02:30 jdas-dev kernel: ena0: device is going DOWN Jan 22 09:02:30 jdas-dev kernel: ena0: ena0: device is going UP Jan 22 09:02:30 jdas-dev kernel: link is UP Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 312763, size: 4096 Jan 22 09:02:30 jdas-dev kernel: 90 second watchdog timeout expired. Shutdo= wn terminated. Jan 22 09:02:30 jdas-dev kernel: Wed Jan 22 08:58:58 CST 2020 Jan 22 09:02:30 jdas-dev kernel: 2020-01-22T08:59:01.658827-06:00 jdas-dev.aws0.pla-net.cc init 1 - - /etc/rc.shutdown terminated abnormally, going to single user mode Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 312763, size: 4096 Jan 22 09:02:30 jdas-dev kernel: 2020-01-22T08:59:23.108170-06:00 jdas-dev.aws0.pla-net.cc init 1 - - some processes would not die; ps axl advised Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 445, size: 12288 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 21337, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 372873, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 21854, size: 57344 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 118775, size: 8192 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 315370, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 312763, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 445, size: 12288 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 21337, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 372873, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 21854, size: 57344 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 118775, size: 8192 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 315370, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 312763, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 445, size: 12288 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 21337, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 21854, size: 57344 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 372873, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 118775, size: 8192 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 315370, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 312763, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 445, size: 12288 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 21337, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 21854, size: 57344 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 372873, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 118775, size: 8192 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 312763, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 315370, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 445, size: 12288 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 21337, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 21854, size: 57344 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 372873, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 118775, size: 8192 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 315370, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 312763, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 445, size: 12288 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 21337, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 21854, size: 57344 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 372873, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 118775, size: 8192 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 315370, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 312763, size: 4096 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 445, size: 12288 Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj= : 0, blkno: 21337, size: 4096 Jan 22 09:02:30 jdas-dev kernel: ---<<BOOT>>--- Jan 22 09:02:30 jdas-dev kernel: Copyright (c) 1992-2019 The FreeBSD Projec= t. Jan 22 09:02:30 jdas-dev kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 --=20 You are receiving this mail because: You are the assignee for the bug.=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-243531-227>