Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 07 May 2020 17:31:00 +0000
From:      bugzilla-noreply@freebsd.org
To:        virtualization@FreeBSD.org
Subject:   [Bug 236989] AWS EC2 lockups "Missing interrupt"
Message-ID:  <bug-236989-27103-YC7WldXEuT@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-236989-27103@https.bugs.freebsd.org/bugzilla/>
References:  <bug-236989-27103@https.bugs.freebsd.org/bugzilla/>

next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D236989

--- Comment #24 from Charles O'Donnell <cao@bus.net> ---
New development. See three notes below.

N.B. the system appears to have fully recovered. Normally I would have expe=
cted
a freeze.



1. CPU alarm from a custom AWS monitor at 16:43 UTC (12:43 PM ET):

Alarm Details:
- Name:                       Starch CPU
- Description:=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
- State Change:               OK -> ALARM
- Reason for State Change:    Threshold Crossed: 1 datapoint [31.4 (07/05/20
16:38:00)] was greater than or equal to the threshold (30.0).
- Timestamp:                  Thursday 07 May, 2020 16:43:35 UTC
- AWS Account:                539612714288
- Alarm Arn:=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
arn:aws:cloudwatch:us-east-1:539612714288:alarm:Starch CPU

Threshold:
- The alarm is in the ALARM state when the metric is
GreaterThanOrEqualToThreshold 30.0 for 300 seconds.=20



2. Sudden jump in failed 9k mbufs between 12:00 and 13:00 ET:

=3D=3D=3D> Thu May  7 10:00:00 EDT 2020
mbuf_jumbo_page:       4096, 490945,       0,      56,45464111,   0,   0
mbuf_jumbo_9k:         9216, 145465,    7538,     450,66361278,1640,   0
mbuf_jumbo_16k:       16384,  81824,       0,       0,       0,   0,   0
dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0
=3D=3D=3D> Thu May  7 11:00:00 EDT 2020
mbuf_jumbo_page:       4096, 490945,      16,     113,45658689,   0,   0
mbuf_jumbo_9k:         9216, 145465,    7592,     397,66645310,1642,   0
mbuf_jumbo_16k:       16384,  81824,       0,       0,       0,   0,   0
dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0
=3D=3D=3D> Thu May  7 12:00:00 EDT 2020
mbuf_jumbo_page:       4096, 490945,     182,      31,45730287,   0,   0
mbuf_jumbo_9k:         9216, 145465,    7461,     259,66753693,1693,   0
mbuf_jumbo_16k:       16384,  81824,       0,       0,       0,   0,   0
dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0
=3D=3D=3D> Thu May  7 13:00:00 EDT 2020
mbuf_jumbo_page:       4096, 490945,     119,     109,46249719,   0,   0
mbuf_jumbo_9k:         9216, 145465,    7863,     207,67594999,2577,   0
mbuf_jumbo_16k:       16384,  81824,       0,       0,       0,   0,   0
dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0



3: ena0 reset at 12:43 ET:

May  7 12:43:19 s4 kernel: ena0: The number of lost tx completion is above =
the
threshold (129 > 128). Reset the device
May  7 12:43:19 s4 kernel: ena0: Trigger reset is on
May  7 12:43:19 s4 kernel: ena0: device is going DOWN
May  7 12:43:22 s4 kernel: ena0: free uncompleted tx mbuf qid 3 idx 0x319en=
a0:
free uncompleted tx mbuf qid 7 idx 0x173
May  7 12:43:23 s4 kernel: ena0: ena0: device is going UP
May  7 12:43:23 s4 kernel: link is UP
May  7 12:45:00 s4 kernel: ena0: The number of lost tx completion is above =
the
threshold (129 > 128). Reset the device
May  7 12:45:00 s4 kernel: ena0: Trigger reset is on
May  7 12:45:00 s4 kernel: ena0: device is going DOWN
May  7 12:45:04 s4 kernel: ena0: free uncompleted tx mbuf qid 3 idx 0x102
May  7 12:45:04 s4 kernel: ena0: ena0: device is going UP
May  7 12:45:04 s4 kernel: link is UP
May  7 12:45:26 s4 kernel: ena0: The number of lost tx completion is above =
the
threshold (129 > 128). Reset the device
May  7 12:45:26 s4 kernel: ena0: Trigger reset is on
May  7 12:45:26 s4 kernel: ena0: device is going DOWN
May  7 12:45:29 s4 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x3c7en=
a0:
free uncompleted tx mbuf qid 2 idx 0x2c5ena0: free uncompleted tx mbuf qid 6
idx 0x2abena0: free uncompleted tx mbuf qid 7 idx 0x241
May  7 12:45:30 s4 kernel:=20
May  7 12:45:30 s4 kernel: stray irq265
May  7 12:45:30 s4 kernel: ena0: ena0: device is going UP
May  7 12:45:30 s4 kernel: link is UP
May  7 12:46:05 s4 kernel: ena0: Keep alive watchdog timeout.
May  7 12:46:05 s4 kernel: ena0: Trigger reset is on
May  7 12:46:05 s4 kernel: ena0: device is going DOWN
May  7 12:46:07 s4 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x123en=
a0:
free uncompleted tx mbuf qid 3 idx 0xeeena0: free uncompleted tx mbuf qid 6=
 idx
0x208
May  7 12:46:08 s4 kernel: ena0: ena0: device is going UP
May  7 12:46:08 s4 kernel: link is UP
May  7 12:46:36 s4 kernel: ena0: The number of lost tx completion is above =
the
threshold (129 > 128). Reset the device
May  7 12:46:36 s4 kernel: ena0: Trigger reset is on
May  7 12:46:36 s4 kernel: ena0: device is going DOWN
May  7 12:46:37 s4 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x2c2en=
a0:
free uncompleted tx mbuf qid 1 idx 0x135ena0: free uncompleted tx mbuf qid 2
idx 0xeeena0: free uncompleted tx mbuf qid 3 idx 0x373ena0: free uncomplete=
d tx
mbuf qid 4 idx 0x88ena0: free uncompleted t>
May  7 12:46:38 s4 kernel: ena0: ena0: device is going UP
May  7 12:46:38 s4 kernel: link is UP

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-236989-27103-YC7WldXEuT>