Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 21 Feb 2020 18:44:43 +0000
From:      bugzilla-noreply@freebsd.org
To:        virtualization@FreeBSD.org
Subject:   [Bug 243531] Unstable ena and nvme on AWS
Message-ID:  <bug-243531-27103-UZkEaRTrUn@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-243531-27103@https.bugs.freebsd.org/bugzilla/>
References:  <bug-243531-27103@https.bugs.freebsd.org/bugzilla/>

next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D243531

--- Comment #2 from Leif Pedersen <leif@ofWilsonCreek.com> ---
I'm at a bit of a loss to come up with anything particularly helpful. A few
thoughts, although mostly naive observations and wild speculation -

It kind of seems like when one machine has a problem, several others do als=
o.
This suggests that it could be triggered by a shared event in the host's
networking or EBS. (None of our instances have local storage.) I don't have
enough machines or samples to show that it's not just a coincidence though.

The nvme errors are always (or almost always?) accompanied by ena errors, b=
ut
ena errors happen without nvme errors sometimes. That suggests it might be
triggered by a network event in the AWS hosting infrastructure, like a netw=
ork
topology change or something.

I'll attach a /var/log/all.log and the screenshot from a crash that happened
today. Probably nothing new there. This time, the machine did not panic, but
rather wedged after Nagios reported its CPU load at 9. There's nothing runn=
ing
on this one besides the hourly zfs snapshot transfers, so I think the load =
from
processes piled up waiting for IO.

The timing of error messages stretches out over many minutes, starting with=
 ena
errors at 02:20:16, and nvme errors finally happen at 02:28:06. Seems odd, =
like
a problem that ramps up rather slowly rather than an abrupt crash.

It's also interesting that these messages on the console screenshot made it
into syslog, so IO must have recovered, if only briefly.

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-243531-27103-UZkEaRTrUn>