Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 11 May 2020 20:56:09 +0000
From:      bugzilla-noreply@freebsd.org
To:        virtualization@FreeBSD.org
Subject:   [Bug 236989] AWS EC2 lockups "Missing interrupt"
Message-ID:  <bug-236989-27103-ron704LEMq@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-236989-27103@https.bugs.freebsd.org/bugzilla/>
References:  <bug-236989-27103@https.bugs.freebsd.org/bugzilla/>

next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D236989

--- Comment #26 from Charles O'Donnell <cao@bus.net> ---
Not sure if this is expected behavior with the patched kernel.

Picking up from day 6 above:

=3D=3D=3D> Thu May  7 12:00:00 EDT 2020
mbuf_jumbo_page:       4096, 490945,     182,      31,45730287,   0,   0
mbuf_jumbo_9k:         9216, 145465,    7461,     259,66753693,1693,   0
mbuf_jumbo_16k:       16384,  81824,       0,       0,       0,   0,   0
dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0

=3D=3D=3D> Fri May  8 12:00:00 EDT 2020
mbuf_jumbo_page:       4096, 490945,      82,      71,52207112,   0,   0
mbuf_jumbo_9k:         9216, 145465,    7551,     351,75614231,2625,   0
mbuf_jumbo_16k:       16384,  81824,       0,       0,       0,   0,   0
dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0

=3D=3D=3D> Sat May  9 12:00:00 EDT 2020
mbuf_jumbo_page:       4096, 490945,     139,      69,56246475,   0,   0
mbuf_jumbo_9k:         9216, 145465,    7799,     339,83036796,2784,   0
mbuf_jumbo_16k:       16384,  81824,       0,       0,       0,   0,   0
dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0

=3D=3D=3D> Sun May 10 12:00:00 EDT 2020
mbuf_jumbo_page:       4096, 490945,      39,      51,59639876,   0,   0
mbuf_jumbo_9k:         9216, 145465,    7663,     222,85652082,2963,   0
mbuf_jumbo_16k:       16384,  81824,       0,       0,       0,   0,   0
dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0

=3D=3D=3D> Mon May 11 12:00:00 EDT 2020
mbuf_jumbo_page:       4096, 490945,    1598,     223,63502113,   0,   0
mbuf_jumbo_9k:         9216, 145465,    6230,      90,91087037,4248,   0
mbuf_jumbo_16k:       16384,  81824,       0,       0,       0,   0,   0
dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0


System spinning. Had to reboot.

Related /var/log/messages activity show below.


May 11 12:20:34 s4 kernel: ena0: The number of lost tx completion is above =
the
threshold (129 > 128). Reset the device
May 11 12:20:34 s4 kernel: ena0: Trigger reset is on
May 11 12:20:34 s4 kernel: ena0: device is going DOWN
May 11 12:20:39 s4 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x2e1
May 11 12:20:40 s4 kernel: ena0: device is going UP
May 11 12:20:40 s4 kernel: ena0: link is UP
May 11 12:21:55 s4 kernel: ena0: The number of lost tx completion is above =
the
threshold (129 > 128). Reset the device
May 11 12:21:55 s4 kernel: ena0: Trigger reset is on
May 11 12:21:55 s4 kernel: ena0: device is going DOWN
May 11 12:22:02 s4 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x172en=
a0:
free uncompleted tx mbuf qid 3 idx 0x66ena0: free>
May 11 12:22:03 s4 kernel:=20
May 11 12:22:03 s4 kernel: stray irq265
May 11 12:22:03 s4 kernel: ena0: ena0: device is going UP
May 11 12:22:03 s4 kernel: link is UP
May 11 12:22:29 s4 kernel: ena0: The number of lost tx completion is above =
the
threshold (129 > 128). Reset the device
May 11 12:22:29 s4 kernel: ena0: Trigger reset is on
May 11 12:22:29 s4 kernel: ena0: device is going DOWN
May 11 12:22:32 s4 kernel: ena0: free uncompleted tx mbuf qid 2 idx 0x32ben=
a0:
free uncompleted tx mbuf qid 4 idx 0x338ena0: fre>
May 11 12:22:33 s4 kernel:=20
May 11 12:22:33 s4 kernel: stray irq265
May 11 12:22:33 s4 kernel: ena0: ena0: device is going UP
May 11 12:22:33 s4 kernel: link is UP
May 11 12:23:04 s4 kernel: ena0: The number of lost tx completion is above =
the
threshold (129 > 128). Reset the device
May 11 12:23:04 s4 kernel: ena0: Trigger reset is on
May 11 12:23:04 s4 kernel: ena0: device is going DOWN
May 11 12:23:11 s4 kernel: ena0: free uncompleted tx mbuf qid 6 idx 0x10e
May 11 12:23:12 s4 kernel: ena0: ena0: device is going UP
May 11 12:23:12 s4 kernel: link is UP
May 11 12:25:33 s4 kernel: ena0: The number of lost tx completion is above =
the
threshold (129 > 128). Reset the device
May 11 12:25:33 s4 kernel: ena0: Trigger reset is on
May 11 12:25:33 s4 kernel: ena0: device is going DOWN
May 11 12:25:38 s4 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x0
May 11 12:25:39 s4 kernel: ena0: ena0: device is going UP
May 11 12:25:39 s4 kernel: link is UP
May 11 12:28:36 s4 kernel: ena0: Keep alive watchdog timeout.
May 11 12:28:36 s4 kernel: ena0: Trigger reset is on
May 11 12:28:36 s4 kernel: ena0: device is going DOWN
May 11 12:28:39 s4 kernel: ena0: device is going UP
May 11 12:28:39 s4 kernel: ena0: link is UP
May 11 12:30:04 s4 kernel: nvme4: cpl does not map to outstanding cmd
May 11 12:30:04 s4 kernel: cdw0:00000000 sqhd:0013 sqid:0002 cid:0012 p:0 s=
c:00
sct:0 m:0 dnr:0
May 11 12:30:04 s4 kernel: nvme4: Missing interrupt
May 11 12:30:04 s4 kernel: nvme1: Missing interrupt
May 11 12:30:34 s4 kernel: nvme4: Resetting controller due to a timeout.
May 11 12:30:34 s4 kernel: nvme4: resetting controller
May 11 12:30:47 s4 kernel: nvme4: temperature threshold not supported
May 11 12:30:47 s4 kernel: nvme4: aborting outstanding i/o
May 11 12:30:47 s4 syslogd: last message repeated 3 times
May 11 12:31:26 s4 kernel: ena0: The number of lost tx completion is above =
the
threshold (129 > 128). Reset the device
May 11 12:31:26 s4 kernel: ena0: Trigger reset is on
May 11 12:31:26 s4 kernel: ena0: device is going DOWN
May 11 12:31:27 s4 kernel: ena0: free uncompleted tx mbuf qid 2 idx 0x20fen=
a0:
free uncompleted tx mbuf qid 4 idx 0x8fena0: free>
May 11 12:31:28 s4 kernel:=20
May 11 12:31:28 s4 kernel: stray irq266
May 11 12:31:28 s4 kernel: stray irq272
May 11 12:31:28 s4 kernel: stray irq265
May 11 12:31:28 s4 kernel: ena0: device is going UP
May 11 12:31:28 s4 kernel: ena0: link is UP
May 11 12:32:46 s4 kernel: nvme1: Missing interrupt
May 11 12:32:46 s4 kernel: nvme4: cpl does not map to outstanding cmd
May 11 12:32:46 s4 kernel: cdw0:00000000 sqhd:0007 sqid:0001 cid:0015 p:1 s=
c:00
sct:0 m:0 dnr:0
May 11 12:32:46 s4 kernel: nvme4: Missing interrupt
May 11 12:32:46 s4 kernel: nvme2: Missing interrupt
May 11 12:33:16 s4 kernel: nvme4: Resetting controller due to a timeout.
May 11 12:33:16 s4 kernel: nvme4: resetting controller
May 11 12:33:17 s4 kernel: nvme4: temperature threshold not supported
May 11 12:33:17 s4 kernel: nvme4: aborting outstanding i/o
May 11 12:33:17 s4 syslogd: last message repeated 2 times
May 11 12:33:40 s4 kernel: ena0: The number of lost tx completion is above =
the
threshold (129 > 128). Reset the device
May 11 12:33:40 s4 kernel: ena0: Trigger reset is on
May 11 12:33:40 s4 kernel: ena0: device is going DOWN
May 11 12:33:43 s4 kernel: ena0: free uncompleted tx mbuf qid 3 idx 0x0
May 11 12:33:44 s4 kernel: ena0: ena0: device is going UP
May 11 12:33:44 s4 kernel: link is UP
May 11 12:34:15 s4 kernel: ena0: Keep alive watchdog timeout.
May 11 12:34:15 s4 kernel: ena0: Trigger reset is on
May 11 12:34:15 s4 kernel: ena0: device is going DOWN
May 11 12:34:21 s4 kernel: ena0: ena0: device is going UP
May 11 12:34:21 s4 kernel: link is UP
May 11 12:35:33 s4 kernel: ena0: The number of lost tx completion is above =
the
threshold (129 > 128). Reset the device
May 11 12:35:33 s4 kernel: ena0: Trigger reset is on
May 11 12:35:33 s4 kernel: ena0: device is going DOWN
May 11 12:35:42 s4 kernel: ena0: free uncompleted tx mbuf qid 3 idx 0x34
May 11 12:35:43 s4 kernel: ena0: ena0: device is going UP
May 11 12:35:43 s4 kernel: link is UP
May 11 12:36:16 s4 kernel: ena0: Keep alive watchdog timeout.
May 11 12:36:16 s4 kernel: ena0: Trigger reset is on
May 11 12:36:16 s4 kernel: ena0: device is going DOWN
May 11 12:36:21 s4 kernel: ena0: free uncompleted tx mbuf qid 7 idx 0x20a
May 11 12:36:22 s4 kernel: ena0: ena0: device is going UP
May 11 12:36:22 s4 kernel: link is UP
May 11 12:37:05 s4 kernel: ena0: The number of lost tx completion is above =
the
threshold (129 > 128). Reset the device
May 11 12:37:05 s4 kernel: ena0: Trigger reset is on
May 11 12:37:05 s4 kernel: ena0: device is going DOWN
May 11 12:37:09 s4 kernel: ena0: free uncompleted tx mbuf qid 4 idx 0x23f
May 11 12:37:10 s4 kernel: ena0: ena0: device is going UP
May 11 12:37:10 s4 kernel: link is UP
May 11 12:38:38 s4 kernel: ena0: The number of lost tx completion is above =
the
threshold (129 > 128). Reset the device
May 11 12:38:38 s4 kernel: ena0: Trigger reset is on
May 11 12:38:38 s4 kernel: ena0: device is going DOWN
May 11 12:38:43 s4 kernel: ena0: free uncompleted tx mbuf qid 5 idx 0x1b3
May 11 12:38:44 s4 kernel:=20
May 11 12:38:44 s4 kernel: too many stray irq 265's: not logging anymore
May 11 12:38:44 s4 kernel: ena0: ena0: device is going UP
May 11 12:38:44 s4 kernel: link is UP
May 11 12:39:17 s4 kernel: ena0: The number of lost tx completion is above =
the
threshold (129 > 128). Reset the device
May 11 12:39:17 s4 kernel: ena0: Trigger reset is on
May 11 12:39:17 s4 kernel: ena0: device is going DOWN
May 11 12:39:25 s4 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x238en=
a0:
free uncompleted tx mbuf qid 6 idx 0x249
May 11 12:39:26 s4 kernel: ena0: ena0: device is going UP
May 11 12:39:26 s4 kernel: link is UP
May 11 12:40:20 s4 kernel: ena0: The number of lost tx completion is above =
the
threshold (129 > 128). Reset the device
May 11 12:40:20 s4 kernel: ena0: Trigger reset is on
May 11 12:40:20 s4 kernel: ena0: device is going DOWN
May 11 12:40:24 s4 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x3c0
May 11 12:40:25 s4 kernel: ena0: device is going UP
May 11 12:40:25 s4 kernel: ena0: link is UP
May 11 12:41:17 s4 kernel: ena0: Keep alive watchdog timeout.
May 11 12:41:17 s4 kernel: ena0: Trigger reset is on
May 11 12:41:17 s4 kernel: ena0: device is going DOWN
May 11 12:41:22 s4 kernel: ena0: device is going UP
May 11 12:41:22 s4 kernel: ena0: link is UP
May 11 12:42:17 s4 kernel: ena0: Keep alive watchdog timeout.
May 11 12:42:17 s4 kernel: ena0: Trigger reset is on
May 11 12:42:17 s4 kernel: ena0: device is going DOWN
May 11 12:42:17 s4 sm-msp-queue[70209]: 04BGPgiQ066459: SYSERR(root): timeo=
ut
writing message to [127.0.0.1]
May 11 12:42:21 s4 kernel: ena0: ena0: device is going UP
May 11 12:42:21 s4 kernel: link is UP
May 11 12:43:01 s4 kernel: ena0: The number of lost tx completion is above =
the
threshold (129 > 128). Reset the device
May 11 12:43:01 s4 kernel: ena0: Trigger reset is on
May 11 12:43:01 s4 kernel: ena0: device is going DOWN
May 11 12:43:06 s4 kernel: ena0: free uncompleted tx mbuf qid 2 idx 0x359
May 11 12:43:07 s4 kernel:=20
May 11 12:43:07 s4 kernel: stray irq266
May 11 12:43:07 s4 kernel: ena0: ena0: device is going UP
May 11 12:43:07 s4 kernel: link is UP
May 11 12:45:23 s4 kernel: ena0: Keep alive watchdog timeout.
May 11 12:45:23 s4 kernel: ena0: Trigger reset is on
May 11 12:45:23 s4 kernel: ena0: device is going DOWN
May 11 12:45:33 s4 kernel: ena0: ena0: device is going UP
May 11 12:45:33 s4 kernel: link is UP
May 11 12:48:08 s4 kernel: ena0: Keep alive watchdog timeout.
May 11 12:48:08 s4 kernel: ena0: Trigger reset is on
May 11 12:48:08 s4 kernel: ena0: device is going DOWN
May 11 12:48:11 s4 kernel: ena0: ena0: device is going UP
May 11 12:48:11 s4 kernel: link is UP
May 11 12:48:51 s4 kernel: ena0: The number of lost tx completion is above =
the
threshold (129 > 128). Reset the device
May 11 12:48:51 s4 kernel: ena0: Trigger reset is on
May 11 12:48:51 s4 kernel: ena0: device is going DOWN
May 11 12:48:55 s4 kernel: ena0: free uncompleted tx mbuf qid 7 idx 0x266
May 11 12:48:55 s4 kernel: ena0: ena0: device is going UP
May 11 12:48:55 s4 kernel: link is UP
May 11 12:50:38 s4 kernel: ena0: Keep alive watchdog timeout.
May 11 12:50:38 s4 kernel: ena0: Trigger reset is on
May 11 12:50:38 s4 kernel: ena0: device is going DOWN
May 11 12:50:41 s4 kernel: ena0: ena0: device is going UP
May 11 12:50:41 s4 kernel: link is UP
May 11 12:51:43 s4 kernel: ena0: The number of lost tx completion is above =
the
threshold (129 > 128). Reset the device
May 11 12:51:43 s4 kernel: ena0: Trigger reset is on
May 11 12:51:43 s4 kernel: ena0: device is going DOWN
May 11 12:51:50 s4 kernel: ena0: free uncompleted tx mbuf qid 5 idx 0x358
May 11 12:51:51 s4 kernel: ena0: ena0: device is going UP
May 11 12:51:51 s4 kernel: link is UP
May 11 12:53:24 s4 kernel: ena0: The number of lost tx completion is above =
the
threshold (129 > 128). Reset the device
May 11 12:53:24 s4 kernel: ena0: Trigger reset is on
May 11 12:53:24 s4 kernel: ena0: device is going DOWN
May 11 12:53:28 s4 kernel: ena0: free uncompleted tx mbuf qid 5 idx 0x100
May 11 12:53:29 s4 kernel: ena0: ena0: device is going UP
May 11 12:53:29 s4 kernel: link is UP
May 11 12:56:19 s4 kernel: ena0: The number of lost tx completion is above =
the
threshold (129 > 128). Reset the device
May 11 12:56:19 s4 kernel: ena0: Trigger reset is on
May 11 12:56:19 s4 kernel: ena0: device is going DOWN
May 11 12:56:22 s4 kernel: ena0: free uncompleted tx mbuf qid 4 idx 0x3d8
May 11 12:56:23 s4 kernel: ena0: ena0: device is going UP
May 11 12:56:23 s4 kernel: link is UP
May 11 12:57:36 s4 kernel: .

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-236989-27103-ron704LEMq>