Date: Thu, 07 May 2020 17:31:00 +0000 From: bugzilla-noreply@freebsd.org To: virtualization@FreeBSD.org Subject: [Bug 236989] AWS EC2 lockups "Missing interrupt" Message-ID: <bug-236989-27103-YC7WldXEuT@https.bugs.freebsd.org/bugzilla/> In-Reply-To: <bug-236989-27103@https.bugs.freebsd.org/bugzilla/> References: <bug-236989-27103@https.bugs.freebsd.org/bugzilla/>
next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D236989 --- Comment #24 from Charles O'Donnell <cao@bus.net> --- New development. See three notes below. N.B. the system appears to have fully recovered. Normally I would have expe= cted a freeze. 1. CPU alarm from a custom AWS monitor at 16:43 UTC (12:43 PM ET): Alarm Details: - Name: Starch CPU - Description:=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 - State Change: OK -> ALARM - Reason for State Change: Threshold Crossed: 1 datapoint [31.4 (07/05/20 16:38:00)] was greater than or equal to the threshold (30.0). - Timestamp: Thursday 07 May, 2020 16:43:35 UTC - AWS Account: 539612714288 - Alarm Arn:=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 arn:aws:cloudwatch:us-east-1:539612714288:alarm:Starch CPU Threshold: - The alarm is in the ALARM state when the metric is GreaterThanOrEqualToThreshold 30.0 for 300 seconds.=20 2. Sudden jump in failed 9k mbufs between 12:00 and 13:00 ET: =3D=3D=3D> Thu May 7 10:00:00 EDT 2020 mbuf_jumbo_page: 4096, 490945, 0, 56,45464111, 0, 0 mbuf_jumbo_9k: 9216, 145465, 7538, 450,66361278,1640, 0 mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0 dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0 =3D=3D=3D> Thu May 7 11:00:00 EDT 2020 mbuf_jumbo_page: 4096, 490945, 16, 113,45658689, 0, 0 mbuf_jumbo_9k: 9216, 145465, 7592, 397,66645310,1642, 0 mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0 dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0 =3D=3D=3D> Thu May 7 12:00:00 EDT 2020 mbuf_jumbo_page: 4096, 490945, 182, 31,45730287, 0, 0 mbuf_jumbo_9k: 9216, 145465, 7461, 259,66753693,1693, 0 mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0 dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0 =3D=3D=3D> Thu May 7 13:00:00 EDT 2020 mbuf_jumbo_page: 4096, 490945, 119, 109,46249719, 0, 0 mbuf_jumbo_9k: 9216, 145465, 7863, 207,67594999,2577, 0 mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0 dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0 3: ena0 reset at 12:43 ET: May 7 12:43:19 s4 kernel: ena0: The number of lost tx completion is above = the threshold (129 > 128). Reset the device May 7 12:43:19 s4 kernel: ena0: Trigger reset is on May 7 12:43:19 s4 kernel: ena0: device is going DOWN May 7 12:43:22 s4 kernel: ena0: free uncompleted tx mbuf qid 3 idx 0x319en= a0: free uncompleted tx mbuf qid 7 idx 0x173 May 7 12:43:23 s4 kernel: ena0: ena0: device is going UP May 7 12:43:23 s4 kernel: link is UP May 7 12:45:00 s4 kernel: ena0: The number of lost tx completion is above = the threshold (129 > 128). Reset the device May 7 12:45:00 s4 kernel: ena0: Trigger reset is on May 7 12:45:00 s4 kernel: ena0: device is going DOWN May 7 12:45:04 s4 kernel: ena0: free uncompleted tx mbuf qid 3 idx 0x102 May 7 12:45:04 s4 kernel: ena0: ena0: device is going UP May 7 12:45:04 s4 kernel: link is UP May 7 12:45:26 s4 kernel: ena0: The number of lost tx completion is above = the threshold (129 > 128). Reset the device May 7 12:45:26 s4 kernel: ena0: Trigger reset is on May 7 12:45:26 s4 kernel: ena0: device is going DOWN May 7 12:45:29 s4 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x3c7en= a0: free uncompleted tx mbuf qid 2 idx 0x2c5ena0: free uncompleted tx mbuf qid 6 idx 0x2abena0: free uncompleted tx mbuf qid 7 idx 0x241 May 7 12:45:30 s4 kernel:=20 May 7 12:45:30 s4 kernel: stray irq265 May 7 12:45:30 s4 kernel: ena0: ena0: device is going UP May 7 12:45:30 s4 kernel: link is UP May 7 12:46:05 s4 kernel: ena0: Keep alive watchdog timeout. May 7 12:46:05 s4 kernel: ena0: Trigger reset is on May 7 12:46:05 s4 kernel: ena0: device is going DOWN May 7 12:46:07 s4 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x123en= a0: free uncompleted tx mbuf qid 3 idx 0xeeena0: free uncompleted tx mbuf qid 6= idx 0x208 May 7 12:46:08 s4 kernel: ena0: ena0: device is going UP May 7 12:46:08 s4 kernel: link is UP May 7 12:46:36 s4 kernel: ena0: The number of lost tx completion is above = the threshold (129 > 128). Reset the device May 7 12:46:36 s4 kernel: ena0: Trigger reset is on May 7 12:46:36 s4 kernel: ena0: device is going DOWN May 7 12:46:37 s4 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x2c2en= a0: free uncompleted tx mbuf qid 1 idx 0x135ena0: free uncompleted tx mbuf qid 2 idx 0xeeena0: free uncompleted tx mbuf qid 3 idx 0x373ena0: free uncomplete= d tx mbuf qid 4 idx 0x88ena0: free uncompleted t> May 7 12:46:38 s4 kernel: ena0: ena0: device is going UP May 7 12:46:38 s4 kernel: link is UP --=20 You are receiving this mail because: You are the assignee for the bug.=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-236989-27103-YC7WldXEuT>