Date: Sun, 09 Sep 2018 00:27:18 +0000 From: bugzilla-noreply@freebsd.org To: virtualization@FreeBSD.org Subject: [Bug 225791] ena driver causing kernel panics on AWS EC2 Message-ID: <bug-225791-27103-0P0UYppwQY@https.bugs.freebsd.org/bugzilla/> In-Reply-To: <bug-225791-27103@https.bugs.freebsd.org/bugzilla/> References: <bug-225791-27103@https.bugs.freebsd.org/bugzilla/>
next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791 Leif Pedersen <leif@ofWilsonCreek.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |leif@ofWilsonCreek.com --- Comment #18 from Leif Pedersen <leif@ofWilsonCreek.com> --- (In reply to pete from comment #16) I've been able to reproduce this repeatedly (but not predictably) on 11.2 on an r4.large. Not to state the blindingly obvious, but smaller instances such as t2.* aren't affected since they use xn instead of ena. It seems to be most likely at times of high network IO, which again risks stating the forehead-slappingly obvious. :) Multiple times, the crash included the same back-trace shown in this bug. However, at least once it panicked on a double-fault, which, if related, suggests that the bug in ena could be incurring memory corruption. Now granted, I only know of one incidence of a double-fault, so it could've been running on a host with faulty RAM or something at the time. However, after each panic, I'd stop/start the instance rather than reboot, to provoke it to move to new hardware, so I'm not suggesting that the whole bug is merely from faulty host hardware. I might beg that the fix could be patched in 11.2, or at least included in 11.3 so it won't have to wait for 12. Otherwise, AWS users will find themselves stuck on 11.1, and the approaching EOL of 11.1 will leave them without security updates, which in turn makes this an indirect security issue. However, I understand there are other considerations at play, and very much appreciate the relentless work of the security team (not to mention the work on AWS support and FreeBSD in general). Probably too much detail: The particular case was our standby MySQL database on an r4.large. It was stable on 11.1, and problematic after I upgraded it to 11.2 (with `freebsd-update upgrade`); after five or so crashes in a month, I downgraded it back to 11.1 (again with `freebsd-update upgrade`), after which it has been perfectly stable for a couple of weeks now. It's in master-master replication with our production replica, and normally gets a fairly low but steady stream of activity from the replication. However, we have several nightly jobs that crank away on updating a model and cause a large volume of traffic in the replication stream. I don't have proper metrics on bytes/sec, so I don't have any idea whether it saturates the interface. It's enough that replication falls behind for up to a few hours, but I wouldn't call our system "huge" in terms of network traffic by any means. The reason I included all that detail is to point out: (1) it seems to be a regression between 11.1 and 11.2, (2) r4.* are for sure affected, and (3) it may be that the problem is more likely to be triggered on moderate or bursty network traffic with much task-switching between MySQL threads, compared to a simple stream of a high speed file transfer, for example. -Leif -- You are receiving this mail because: You are the assignee for the bug.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-225791-27103-0P0UYppwQY>
