From owner-freebsd-virtualization@freebsd.org Mon May 11 20:56:09 2020 Return-Path: Delivered-To: freebsd-virtualization@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id F1FAA2EC0DA for ; Mon, 11 May 2020 20:56:09 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mailman.nyi.freebsd.org (mailman.nyi.freebsd.org [IPv6:2610:1c1:1:606c::50:13]) by mx1.freebsd.org (Postfix) with ESMTP id 49LYBK6HjNz48gr for ; Mon, 11 May 2020 20:56:09 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: by mailman.nyi.freebsd.org (Postfix) id D7BC22EC0D8; Mon, 11 May 2020 20:56:09 +0000 (UTC) Delivered-To: virtualization@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id D773C2EC0D7 for ; Mon, 11 May 2020 20:56:09 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (4096 bits) client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 49LYBK5fFqz48gp for ; Mon, 11 May 2020 20:56:09 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2610:1c1:1:606c::50:1d]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (4096 bits)) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id BC8F424801 for ; Mon, 11 May 2020 20:56:09 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org ([127.0.1.5]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id 04BKu9AE057813 for ; Mon, 11 May 2020 20:56:09 GMT (envelope-from bugzilla-noreply@freebsd.org) Received: (from www@localhost) by kenobi.freebsd.org (8.15.2/8.15.2/Submit) id 04BKu9Au057812 for virtualization@FreeBSD.org; Mon, 11 May 2020 20:56:09 GMT (envelope-from bugzilla-noreply@freebsd.org) X-Authentication-Warning: kenobi.freebsd.org: www set sender to bugzilla-noreply@freebsd.org using -f From: bugzilla-noreply@freebsd.org To: virtualization@FreeBSD.org Subject: [Bug 236989] AWS EC2 lockups "Missing interrupt" Date: Mon, 11 May 2020 20:56:09 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 12.0-RELEASE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Only Me X-Bugzilla-Who: cao@bus.net X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: virtualization@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-virtualization@freebsd.org X-Mailman-Version: 2.1.33 Precedence: list List-Id: "Discussion of various virtualization techniques FreeBSD supports." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 11 May 2020 20:56:10 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D236989 --- Comment #26 from Charles O'Donnell --- Not sure if this is expected behavior with the patched kernel. Picking up from day 6 above: =3D=3D=3D> Thu May 7 12:00:00 EDT 2020 mbuf_jumbo_page: 4096, 490945, 182, 31,45730287, 0, 0 mbuf_jumbo_9k: 9216, 145465, 7461, 259,66753693,1693, 0 mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0 dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0 =3D=3D=3D> Fri May 8 12:00:00 EDT 2020 mbuf_jumbo_page: 4096, 490945, 82, 71,52207112, 0, 0 mbuf_jumbo_9k: 9216, 145465, 7551, 351,75614231,2625, 0 mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0 dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0 =3D=3D=3D> Sat May 9 12:00:00 EDT 2020 mbuf_jumbo_page: 4096, 490945, 139, 69,56246475, 0, 0 mbuf_jumbo_9k: 9216, 145465, 7799, 339,83036796,2784, 0 mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0 dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0 =3D=3D=3D> Sun May 10 12:00:00 EDT 2020 mbuf_jumbo_page: 4096, 490945, 39, 51,59639876, 0, 0 mbuf_jumbo_9k: 9216, 145465, 7663, 222,85652082,2963, 0 mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0 dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0 =3D=3D=3D> Mon May 11 12:00:00 EDT 2020 mbuf_jumbo_page: 4096, 490945, 1598, 223,63502113, 0, 0 mbuf_jumbo_9k: 9216, 145465, 6230, 90,91087037,4248, 0 mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0 dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0 System spinning. Had to reboot. Related /var/log/messages activity show below. May 11 12:20:34 s4 kernel: ena0: The number of lost tx completion is above = the threshold (129 > 128). Reset the device May 11 12:20:34 s4 kernel: ena0: Trigger reset is on May 11 12:20:34 s4 kernel: ena0: device is going DOWN May 11 12:20:39 s4 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x2e1 May 11 12:20:40 s4 kernel: ena0: device is going UP May 11 12:20:40 s4 kernel: ena0: link is UP May 11 12:21:55 s4 kernel: ena0: The number of lost tx completion is above = the threshold (129 > 128). Reset the device May 11 12:21:55 s4 kernel: ena0: Trigger reset is on May 11 12:21:55 s4 kernel: ena0: device is going DOWN May 11 12:22:02 s4 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x172en= a0: free uncompleted tx mbuf qid 3 idx 0x66ena0: free> May 11 12:22:03 s4 kernel:=20 May 11 12:22:03 s4 kernel: stray irq265 May 11 12:22:03 s4 kernel: ena0: ena0: device is going UP May 11 12:22:03 s4 kernel: link is UP May 11 12:22:29 s4 kernel: ena0: The number of lost tx completion is above = the threshold (129 > 128). Reset the device May 11 12:22:29 s4 kernel: ena0: Trigger reset is on May 11 12:22:29 s4 kernel: ena0: device is going DOWN May 11 12:22:32 s4 kernel: ena0: free uncompleted tx mbuf qid 2 idx 0x32ben= a0: free uncompleted tx mbuf qid 4 idx 0x338ena0: fre> May 11 12:22:33 s4 kernel:=20 May 11 12:22:33 s4 kernel: stray irq265 May 11 12:22:33 s4 kernel: ena0: ena0: device is going UP May 11 12:22:33 s4 kernel: link is UP May 11 12:23:04 s4 kernel: ena0: The number of lost tx completion is above = the threshold (129 > 128). Reset the device May 11 12:23:04 s4 kernel: ena0: Trigger reset is on May 11 12:23:04 s4 kernel: ena0: device is going DOWN May 11 12:23:11 s4 kernel: ena0: free uncompleted tx mbuf qid 6 idx 0x10e May 11 12:23:12 s4 kernel: ena0: ena0: device is going UP May 11 12:23:12 s4 kernel: link is UP May 11 12:25:33 s4 kernel: ena0: The number of lost tx completion is above = the threshold (129 > 128). Reset the device May 11 12:25:33 s4 kernel: ena0: Trigger reset is on May 11 12:25:33 s4 kernel: ena0: device is going DOWN May 11 12:25:38 s4 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x0 May 11 12:25:39 s4 kernel: ena0: ena0: device is going UP May 11 12:25:39 s4 kernel: link is UP May 11 12:28:36 s4 kernel: ena0: Keep alive watchdog timeout. May 11 12:28:36 s4 kernel: ena0: Trigger reset is on May 11 12:28:36 s4 kernel: ena0: device is going DOWN May 11 12:28:39 s4 kernel: ena0: device is going UP May 11 12:28:39 s4 kernel: ena0: link is UP May 11 12:30:04 s4 kernel: nvme4: cpl does not map to outstanding cmd May 11 12:30:04 s4 kernel: cdw0:00000000 sqhd:0013 sqid:0002 cid:0012 p:0 s= c:00 sct:0 m:0 dnr:0 May 11 12:30:04 s4 kernel: nvme4: Missing interrupt May 11 12:30:04 s4 kernel: nvme1: Missing interrupt May 11 12:30:34 s4 kernel: nvme4: Resetting controller due to a timeout. May 11 12:30:34 s4 kernel: nvme4: resetting controller May 11 12:30:47 s4 kernel: nvme4: temperature threshold not supported May 11 12:30:47 s4 kernel: nvme4: aborting outstanding i/o May 11 12:30:47 s4 syslogd: last message repeated 3 times May 11 12:31:26 s4 kernel: ena0: The number of lost tx completion is above = the threshold (129 > 128). Reset the device May 11 12:31:26 s4 kernel: ena0: Trigger reset is on May 11 12:31:26 s4 kernel: ena0: device is going DOWN May 11 12:31:27 s4 kernel: ena0: free uncompleted tx mbuf qid 2 idx 0x20fen= a0: free uncompleted tx mbuf qid 4 idx 0x8fena0: free> May 11 12:31:28 s4 kernel:=20 May 11 12:31:28 s4 kernel: stray irq266 May 11 12:31:28 s4 kernel: stray irq272 May 11 12:31:28 s4 kernel: stray irq265 May 11 12:31:28 s4 kernel: ena0: device is going UP May 11 12:31:28 s4 kernel: ena0: link is UP May 11 12:32:46 s4 kernel: nvme1: Missing interrupt May 11 12:32:46 s4 kernel: nvme4: cpl does not map to outstanding cmd May 11 12:32:46 s4 kernel: cdw0:00000000 sqhd:0007 sqid:0001 cid:0015 p:1 s= c:00 sct:0 m:0 dnr:0 May 11 12:32:46 s4 kernel: nvme4: Missing interrupt May 11 12:32:46 s4 kernel: nvme2: Missing interrupt May 11 12:33:16 s4 kernel: nvme4: Resetting controller due to a timeout. May 11 12:33:16 s4 kernel: nvme4: resetting controller May 11 12:33:17 s4 kernel: nvme4: temperature threshold not supported May 11 12:33:17 s4 kernel: nvme4: aborting outstanding i/o May 11 12:33:17 s4 syslogd: last message repeated 2 times May 11 12:33:40 s4 kernel: ena0: The number of lost tx completion is above = the threshold (129 > 128). Reset the device May 11 12:33:40 s4 kernel: ena0: Trigger reset is on May 11 12:33:40 s4 kernel: ena0: device is going DOWN May 11 12:33:43 s4 kernel: ena0: free uncompleted tx mbuf qid 3 idx 0x0 May 11 12:33:44 s4 kernel: ena0: ena0: device is going UP May 11 12:33:44 s4 kernel: link is UP May 11 12:34:15 s4 kernel: ena0: Keep alive watchdog timeout. May 11 12:34:15 s4 kernel: ena0: Trigger reset is on May 11 12:34:15 s4 kernel: ena0: device is going DOWN May 11 12:34:21 s4 kernel: ena0: ena0: device is going UP May 11 12:34:21 s4 kernel: link is UP May 11 12:35:33 s4 kernel: ena0: The number of lost tx completion is above = the threshold (129 > 128). Reset the device May 11 12:35:33 s4 kernel: ena0: Trigger reset is on May 11 12:35:33 s4 kernel: ena0: device is going DOWN May 11 12:35:42 s4 kernel: ena0: free uncompleted tx mbuf qid 3 idx 0x34 May 11 12:35:43 s4 kernel: ena0: ena0: device is going UP May 11 12:35:43 s4 kernel: link is UP May 11 12:36:16 s4 kernel: ena0: Keep alive watchdog timeout. May 11 12:36:16 s4 kernel: ena0: Trigger reset is on May 11 12:36:16 s4 kernel: ena0: device is going DOWN May 11 12:36:21 s4 kernel: ena0: free uncompleted tx mbuf qid 7 idx 0x20a May 11 12:36:22 s4 kernel: ena0: ena0: device is going UP May 11 12:36:22 s4 kernel: link is UP May 11 12:37:05 s4 kernel: ena0: The number of lost tx completion is above = the threshold (129 > 128). Reset the device May 11 12:37:05 s4 kernel: ena0: Trigger reset is on May 11 12:37:05 s4 kernel: ena0: device is going DOWN May 11 12:37:09 s4 kernel: ena0: free uncompleted tx mbuf qid 4 idx 0x23f May 11 12:37:10 s4 kernel: ena0: ena0: device is going UP May 11 12:37:10 s4 kernel: link is UP May 11 12:38:38 s4 kernel: ena0: The number of lost tx completion is above = the threshold (129 > 128). Reset the device May 11 12:38:38 s4 kernel: ena0: Trigger reset is on May 11 12:38:38 s4 kernel: ena0: device is going DOWN May 11 12:38:43 s4 kernel: ena0: free uncompleted tx mbuf qid 5 idx 0x1b3 May 11 12:38:44 s4 kernel:=20 May 11 12:38:44 s4 kernel: too many stray irq 265's: not logging anymore May 11 12:38:44 s4 kernel: ena0: ena0: device is going UP May 11 12:38:44 s4 kernel: link is UP May 11 12:39:17 s4 kernel: ena0: The number of lost tx completion is above = the threshold (129 > 128). Reset the device May 11 12:39:17 s4 kernel: ena0: Trigger reset is on May 11 12:39:17 s4 kernel: ena0: device is going DOWN May 11 12:39:25 s4 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x238en= a0: free uncompleted tx mbuf qid 6 idx 0x249 May 11 12:39:26 s4 kernel: ena0: ena0: device is going UP May 11 12:39:26 s4 kernel: link is UP May 11 12:40:20 s4 kernel: ena0: The number of lost tx completion is above = the threshold (129 > 128). Reset the device May 11 12:40:20 s4 kernel: ena0: Trigger reset is on May 11 12:40:20 s4 kernel: ena0: device is going DOWN May 11 12:40:24 s4 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x3c0 May 11 12:40:25 s4 kernel: ena0: device is going UP May 11 12:40:25 s4 kernel: ena0: link is UP May 11 12:41:17 s4 kernel: ena0: Keep alive watchdog timeout. May 11 12:41:17 s4 kernel: ena0: Trigger reset is on May 11 12:41:17 s4 kernel: ena0: device is going DOWN May 11 12:41:22 s4 kernel: ena0: device is going UP May 11 12:41:22 s4 kernel: ena0: link is UP May 11 12:42:17 s4 kernel: ena0: Keep alive watchdog timeout. May 11 12:42:17 s4 kernel: ena0: Trigger reset is on May 11 12:42:17 s4 kernel: ena0: device is going DOWN May 11 12:42:17 s4 sm-msp-queue[70209]: 04BGPgiQ066459: SYSERR(root): timeo= ut writing message to [127.0.0.1] May 11 12:42:21 s4 kernel: ena0: ena0: device is going UP May 11 12:42:21 s4 kernel: link is UP May 11 12:43:01 s4 kernel: ena0: The number of lost tx completion is above = the threshold (129 > 128). Reset the device May 11 12:43:01 s4 kernel: ena0: Trigger reset is on May 11 12:43:01 s4 kernel: ena0: device is going DOWN May 11 12:43:06 s4 kernel: ena0: free uncompleted tx mbuf qid 2 idx 0x359 May 11 12:43:07 s4 kernel:=20 May 11 12:43:07 s4 kernel: stray irq266 May 11 12:43:07 s4 kernel: ena0: ena0: device is going UP May 11 12:43:07 s4 kernel: link is UP May 11 12:45:23 s4 kernel: ena0: Keep alive watchdog timeout. May 11 12:45:23 s4 kernel: ena0: Trigger reset is on May 11 12:45:23 s4 kernel: ena0: device is going DOWN May 11 12:45:33 s4 kernel: ena0: ena0: device is going UP May 11 12:45:33 s4 kernel: link is UP May 11 12:48:08 s4 kernel: ena0: Keep alive watchdog timeout. May 11 12:48:08 s4 kernel: ena0: Trigger reset is on May 11 12:48:08 s4 kernel: ena0: device is going DOWN May 11 12:48:11 s4 kernel: ena0: ena0: device is going UP May 11 12:48:11 s4 kernel: link is UP May 11 12:48:51 s4 kernel: ena0: The number of lost tx completion is above = the threshold (129 > 128). Reset the device May 11 12:48:51 s4 kernel: ena0: Trigger reset is on May 11 12:48:51 s4 kernel: ena0: device is going DOWN May 11 12:48:55 s4 kernel: ena0: free uncompleted tx mbuf qid 7 idx 0x266 May 11 12:48:55 s4 kernel: ena0: ena0: device is going UP May 11 12:48:55 s4 kernel: link is UP May 11 12:50:38 s4 kernel: ena0: Keep alive watchdog timeout. May 11 12:50:38 s4 kernel: ena0: Trigger reset is on May 11 12:50:38 s4 kernel: ena0: device is going DOWN May 11 12:50:41 s4 kernel: ena0: ena0: device is going UP May 11 12:50:41 s4 kernel: link is UP May 11 12:51:43 s4 kernel: ena0: The number of lost tx completion is above = the threshold (129 > 128). Reset the device May 11 12:51:43 s4 kernel: ena0: Trigger reset is on May 11 12:51:43 s4 kernel: ena0: device is going DOWN May 11 12:51:50 s4 kernel: ena0: free uncompleted tx mbuf qid 5 idx 0x358 May 11 12:51:51 s4 kernel: ena0: ena0: device is going UP May 11 12:51:51 s4 kernel: link is UP May 11 12:53:24 s4 kernel: ena0: The number of lost tx completion is above = the threshold (129 > 128). Reset the device May 11 12:53:24 s4 kernel: ena0: Trigger reset is on May 11 12:53:24 s4 kernel: ena0: device is going DOWN May 11 12:53:28 s4 kernel: ena0: free uncompleted tx mbuf qid 5 idx 0x100 May 11 12:53:29 s4 kernel: ena0: ena0: device is going UP May 11 12:53:29 s4 kernel: link is UP May 11 12:56:19 s4 kernel: ena0: The number of lost tx completion is above = the threshold (129 > 128). Reset the device May 11 12:56:19 s4 kernel: ena0: Trigger reset is on May 11 12:56:19 s4 kernel: ena0: device is going DOWN May 11 12:56:22 s4 kernel: ena0: free uncompleted tx mbuf qid 4 idx 0x3d8 May 11 12:56:23 s4 kernel: ena0: ena0: device is going UP May 11 12:56:23 s4 kernel: link is UP May 11 12:57:36 s4 kernel: . --=20 You are receiving this mail because: You are the assignee for the bug.=