From owner-freebsd-virtualization@freebsd.org Thu May 7 17:31:01 2020 Return-Path: Delivered-To: freebsd-virtualization@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 8498C2DF4F2 for ; Thu, 7 May 2020 17:31:01 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mailman.nyi.freebsd.org (mailman.nyi.freebsd.org [IPv6:2610:1c1:1:606c::50:13]) by mx1.freebsd.org (Postfix) with ESMTP id 49J0qT2zXbz4SWW for ; Thu, 7 May 2020 17:31:01 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: by mailman.nyi.freebsd.org (Postfix) id 662F42DF4F1; Thu, 7 May 2020 17:31:01 +0000 (UTC) Delivered-To: virtualization@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 65F1F2DF4F0 for ; Thu, 7 May 2020 17:31:01 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (4096 bits) client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 49J0qT26bXz4SWV for ; Thu, 7 May 2020 17:31:01 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2610:1c1:1:606c::50:1d]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (4096 bits)) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 441F51CA0C for ; Thu, 7 May 2020 17:31:01 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org ([127.0.1.5]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id 047HV1C6049594 for ; Thu, 7 May 2020 17:31:01 GMT (envelope-from bugzilla-noreply@freebsd.org) Received: (from www@localhost) by kenobi.freebsd.org (8.15.2/8.15.2/Submit) id 047HV1S1049572 for virtualization@FreeBSD.org; Thu, 7 May 2020 17:31:01 GMT (envelope-from bugzilla-noreply@freebsd.org) X-Authentication-Warning: kenobi.freebsd.org: www set sender to bugzilla-noreply@freebsd.org using -f From: bugzilla-noreply@freebsd.org To: virtualization@FreeBSD.org Subject: [Bug 236989] AWS EC2 lockups "Missing interrupt" Date: Thu, 07 May 2020 17:31:00 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 12.0-RELEASE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Only Me X-Bugzilla-Who: cao@bus.net X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: virtualization@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-virtualization@freebsd.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: "Discussion of various virtualization techniques FreeBSD supports." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 May 2020 17:31:01 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D236989 --- Comment #24 from Charles O'Donnell --- New development. See three notes below. N.B. the system appears to have fully recovered. Normally I would have expe= cted a freeze. 1. CPU alarm from a custom AWS monitor at 16:43 UTC (12:43 PM ET): Alarm Details: - Name: Starch CPU - Description:=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 - State Change: OK -> ALARM - Reason for State Change: Threshold Crossed: 1 datapoint [31.4 (07/05/20 16:38:00)] was greater than or equal to the threshold (30.0). - Timestamp: Thursday 07 May, 2020 16:43:35 UTC - AWS Account: 539612714288 - Alarm Arn:=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 arn:aws:cloudwatch:us-east-1:539612714288:alarm:Starch CPU Threshold: - The alarm is in the ALARM state when the metric is GreaterThanOrEqualToThreshold 30.0 for 300 seconds.=20 2. Sudden jump in failed 9k mbufs between 12:00 and 13:00 ET: =3D=3D=3D> Thu May 7 10:00:00 EDT 2020 mbuf_jumbo_page: 4096, 490945, 0, 56,45464111, 0, 0 mbuf_jumbo_9k: 9216, 145465, 7538, 450,66361278,1640, 0 mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0 dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0 =3D=3D=3D> Thu May 7 11:00:00 EDT 2020 mbuf_jumbo_page: 4096, 490945, 16, 113,45658689, 0, 0 mbuf_jumbo_9k: 9216, 145465, 7592, 397,66645310,1642, 0 mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0 dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0 =3D=3D=3D> Thu May 7 12:00:00 EDT 2020 mbuf_jumbo_page: 4096, 490945, 182, 31,45730287, 0, 0 mbuf_jumbo_9k: 9216, 145465, 7461, 259,66753693,1693, 0 mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0 dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0 =3D=3D=3D> Thu May 7 13:00:00 EDT 2020 mbuf_jumbo_page: 4096, 490945, 119, 109,46249719, 0, 0 mbuf_jumbo_9k: 9216, 145465, 7863, 207,67594999,2577, 0 mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0 dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0 3: ena0 reset at 12:43 ET: May 7 12:43:19 s4 kernel: ena0: The number of lost tx completion is above = the threshold (129 > 128). Reset the device May 7 12:43:19 s4 kernel: ena0: Trigger reset is on May 7 12:43:19 s4 kernel: ena0: device is going DOWN May 7 12:43:22 s4 kernel: ena0: free uncompleted tx mbuf qid 3 idx 0x319en= a0: free uncompleted tx mbuf qid 7 idx 0x173 May 7 12:43:23 s4 kernel: ena0: ena0: device is going UP May 7 12:43:23 s4 kernel: link is UP May 7 12:45:00 s4 kernel: ena0: The number of lost tx completion is above = the threshold (129 > 128). Reset the device May 7 12:45:00 s4 kernel: ena0: Trigger reset is on May 7 12:45:00 s4 kernel: ena0: device is going DOWN May 7 12:45:04 s4 kernel: ena0: free uncompleted tx mbuf qid 3 idx 0x102 May 7 12:45:04 s4 kernel: ena0: ena0: device is going UP May 7 12:45:04 s4 kernel: link is UP May 7 12:45:26 s4 kernel: ena0: The number of lost tx completion is above = the threshold (129 > 128). Reset the device May 7 12:45:26 s4 kernel: ena0: Trigger reset is on May 7 12:45:26 s4 kernel: ena0: device is going DOWN May 7 12:45:29 s4 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x3c7en= a0: free uncompleted tx mbuf qid 2 idx 0x2c5ena0: free uncompleted tx mbuf qid 6 idx 0x2abena0: free uncompleted tx mbuf qid 7 idx 0x241 May 7 12:45:30 s4 kernel:=20 May 7 12:45:30 s4 kernel: stray irq265 May 7 12:45:30 s4 kernel: ena0: ena0: device is going UP May 7 12:45:30 s4 kernel: link is UP May 7 12:46:05 s4 kernel: ena0: Keep alive watchdog timeout. May 7 12:46:05 s4 kernel: ena0: Trigger reset is on May 7 12:46:05 s4 kernel: ena0: device is going DOWN May 7 12:46:07 s4 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x123en= a0: free uncompleted tx mbuf qid 3 idx 0xeeena0: free uncompleted tx mbuf qid 6= idx 0x208 May 7 12:46:08 s4 kernel: ena0: ena0: device is going UP May 7 12:46:08 s4 kernel: link is UP May 7 12:46:36 s4 kernel: ena0: The number of lost tx completion is above = the threshold (129 > 128). Reset the device May 7 12:46:36 s4 kernel: ena0: Trigger reset is on May 7 12:46:36 s4 kernel: ena0: device is going DOWN May 7 12:46:37 s4 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x2c2en= a0: free uncompleted tx mbuf qid 1 idx 0x135ena0: free uncompleted tx mbuf qid 2 idx 0xeeena0: free uncompleted tx mbuf qid 3 idx 0x373ena0: free uncomplete= d tx mbuf qid 4 idx 0x88ena0: free uncompleted t> May 7 12:46:38 s4 kernel: ena0: ena0: device is going UP May 7 12:46:38 s4 kernel: link is UP --=20 You are receiving this mail because: You are the assignee for the bug.=