From owner-freebsd-virtualization@freebsd.org Fri Feb 21 18:44:45 2020 Return-Path: Delivered-To: freebsd-virtualization@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 5462223ECBB for ; Fri, 21 Feb 2020 18:44:45 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mailman.nyi.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id 48PL3d0dbGz4YgN for ; Fri, 21 Feb 2020 18:44:45 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: by mailman.nyi.freebsd.org (Postfix) id 0C54C23ECBA; Fri, 21 Feb 2020 18:44:45 +0000 (UTC) Delivered-To: virtualization@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 0C0EF23ECB9 for ; Fri, 21 Feb 2020 18:44:45 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (4096 bits) client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 48PL3c5D2Rz4Yfm for ; Fri, 21 Feb 2020 18:44:44 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2610:1c1:1:606c::50:1d]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (4096 bits)) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 72BB8F1E9 for ; Fri, 21 Feb 2020 18:44:44 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org ([127.0.1.5]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id 01LIiiF0012012 for ; Fri, 21 Feb 2020 18:44:44 GMT (envelope-from bugzilla-noreply@freebsd.org) Received: (from www@localhost) by kenobi.freebsd.org (8.15.2/8.15.2/Submit) id 01LIiiJB012010 for virtualization@FreeBSD.org; Fri, 21 Feb 2020 18:44:44 GMT (envelope-from bugzilla-noreply@freebsd.org) X-Authentication-Warning: kenobi.freebsd.org: www set sender to bugzilla-noreply@freebsd.org using -f From: bugzilla-noreply@freebsd.org To: virtualization@FreeBSD.org Subject: [Bug 243531] Unstable ena and nvme on AWS Date: Fri, 21 Feb 2020 18:44:43 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 12.1-RELEASE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Some People X-Bugzilla-Who: leif@ofWilsonCreek.com X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: virtualization@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-virtualization@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Discussion of various virtualization techniques FreeBSD supports." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 21 Feb 2020 18:44:45 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D243531 --- Comment #2 from Leif Pedersen --- I'm at a bit of a loss to come up with anything particularly helpful. A few thoughts, although mostly naive observations and wild speculation - It kind of seems like when one machine has a problem, several others do als= o. This suggests that it could be triggered by a shared event in the host's networking or EBS. (None of our instances have local storage.) I don't have enough machines or samples to show that it's not just a coincidence though. The nvme errors are always (or almost always?) accompanied by ena errors, b= ut ena errors happen without nvme errors sometimes. That suggests it might be triggered by a network event in the AWS hosting infrastructure, like a netw= ork topology change or something. I'll attach a /var/log/all.log and the screenshot from a crash that happened today. Probably nothing new there. This time, the machine did not panic, but rather wedged after Nagios reported its CPU load at 9. There's nothing runn= ing on this one besides the hourly zfs snapshot transfers, so I think the load = from processes piled up waiting for IO. The timing of error messages stretches out over many minutes, starting with= ena errors at 02:20:16, and nvme errors finally happen at 02:28:06. Seems odd, = like a problem that ramps up rather slowly rather than an abrupt crash. It's also interesting that these messages on the console screenshot made it into syslog, so IO must have recovered, if only briefly. --=20 You are receiving this mail because: You are the assignee for the bug.=