From owner-freebsd-virtualization@FreeBSD.ORG Fri Jun 27 23:46:56 2014 Return-Path: Delivered-To: freebsd-virtualization@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 1F92D812; Fri, 27 Jun 2014 23:46:56 +0000 (UTC) Received: from mail.ignoranthack.me (ignoranthack.me [199.102.79.106]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id EE65A2A83; Fri, 27 Jun 2014 23:46:55 +0000 (UTC) Received: from [192.168.200.204] (c-50-131-5-126.hsd1.ca.comcast.net [50.131.5.126]) (using SSLv3 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: sbruno@ignoranthack.me) by mail.ignoranthack.me (Postfix) with ESMTPSA id 7B2BA1936DE; Fri, 27 Jun 2014 23:46:53 +0000 (UTC) Subject: Re: jenkins bhyve vms crashing and burning after several days of use From: Sean Bruno Reply-To: sbruno@freebsd.org To: Neel Natu In-Reply-To: References: <1403818926.2417.6.camel@bruno> <1403819194.2417.8.camel@bruno> <1403821402.2417.12.camel@bruno> Content-Type: text/plain; charset="us-ascii" Date: Fri, 27 Jun 2014 16:46:51 -0700 Message-ID: <1403912811.5727.7.camel@bruno> Mime-Version: 1.0 X-Mailer: Evolution 2.32.1 FreeBSD GNOME Team Port Content-Transfer-Encoding: 7bit Cc: "freebsd-virtualization@freebsd.org" X-BeenThere: freebsd-virtualization@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: "Discussion of various virtualization techniques FreeBSD supports." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 27 Jun 2014 23:46:56 -0000 On Fri, 2014-06-27 at 14:23 -0700, Neel Natu wrote: > Hi, > > On Thu, Jun 26, 2014 at 3:43 PM, Neel Natu wrote: > > Hi Sean, > > > > On Thu, Jun 26, 2014 at 3:23 PM, Sean Bruno wrote: > >> On Thu, 2014-06-26 at 15:00 -0700, Neel Natu wrote: > >>> Hi Sean, > >>> > >>> On Thu, Jun 26, 2014 at 2:46 PM, Sean Bruno wrote: > >>> > On Thu, 2014-06-26 at 14:42 -0700, Sean Bruno wrote: > >>> >> so, we're seeing the bhyve vms running in the freebsd cluster for > >>> >> jenkins crashing and burning after a couple of days of use. > >>> >> > >>> >> vm exit[9] > >>> >> reason VMX > >>> >> rip 0x0000000029286336 > >>> >> inst_length 3 > >>> >> status 0 > >>> >> exit_reason 49 > >>> >> qualification 0x0000000000000000 > >>> >> inst_type 0 > >>> >> inst_error 0 > >>> >> > >>> >> > >>> >> It looks like we have an active core file on havoc.ysv if you have a > >>> >> moment to look at it: > >>> >> > >>> >> http://people.freebsd.org/~sbruno/bhyve.core > >>> >> > >>> >> FreeBSD havoc.ysv.freebsd.org 11.0-CURRENT FreeBSD 11.0-CURRENT #2 > >>> >> r267362: Wed Jun 11 14:56:34 UTC 2014 > >>> >> sbruno@havoc.freebsd.org:/usr/obj/usr/src/sys/HAVOC amd64 > >>> >> > >>> > > >>> > Also, from chaos.ysv > >>> > > >>> > http://people.freebsd.org/~sbruno/bhyve.core.chaos > >>> > > >>> > FreeBSD chaos.ysv.freebsd.org 11.0-CURRENT FreeBSD 11.0-CURRENT #1 > >>> > r267362: Wed Jun 11 15:50:24 UTC 2014 > >>> > sbruno@chaos.ysv.freebsd.org:/usr/obj/usr/src/sys/CHAOS amd64 > >>> > > >>> > >>> Can you tell us the processor and memory configuration on havoc and chaos? > >>> > >>> Also, could you execute the following commands on havoc: > >>> > >>> # bhyvectl --vm=vmname --cpu=9 --get-vmcs-guest-physical-address > >>> -- this will output the offending guest physical address that > >>> triggered the EPT misconfiguration > >>> > >>> # bhyvectl --vm=vmname --get-gpa-pmap= > >>> -- this will output the page table entries in the EPT that map to the > >>> offending GPA > >>> > >>> Hopefully that provides us with something to work with. > >>> > >>> best > >>> Neel > >>> > >>> > > >> > >> chaos: > >> CPU: Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz (2200.05-MHz K8-class CPU) > >> Origin="GenuineIntel" Id=0x206d6 Family=0x6 Model=0x2d Stepping=6 > >> Features=0xbfebfbff > >> Features2=0x1fbee3ff > >> AMD Features=0x2c100800 > >> AMD Features2=0x1 > >> TSC: P-state invariant, performance statistics > >> avail memory = 66298322944 (63227 MB) > >> > >> havoc: > >> FreeBSD clang version 3.4.1 (tags/RELEASE_34/dot1-final 208032) 20140512 > >> CPU: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz (2400.14-MHz > >> K8-class CPU) > >> Origin="GenuineIntel" Id=0x206c2 Family=0x6 Model=0x2c Stepping=2 > >> Features=0xbfebfbff > >> Features2=0x29ee3ff > >> AMD Features=0x2c100800 > >> AMD Features2=0x1 > >> TSC: P-state invariant, performance statistics > >> avail memory = 16571621376 (15803 MB) > >> > > > > Thanks, we'll see if there are relevant errata for these processors. > > > > Actually these processors have entirely different microarchitectures > (Nehalem and Sandybridge) so its unlikely that this is due to > processor errata. > > >> > >> There appear to be three vms running on havoc: > >> root@havoc.ysv:/home/sbruno # bhyvectl --vm=vm1 --cpu=9 > >> --get-vmcs-guest-physical-address > >> gpa[9] 0x0000000000000000 > >> root@havoc.ysv:/home/sbruno # bhyvectl --vm=vm2 --cpu=9 > >> --get-vmcs-guest-physical-address > >> gpa[9] 0x0000000000000000 > >> root@havoc.ysv:/home/sbruno # bhyvectl --vm=vm3 --cpu=9 > >> --get-vmcs-guest-physical-address > >> gpa[9] 0x0000000000000000 > >> > >> root@havoc.ysv:/home/sbruno # bhyvectl --vm=vm1 --cpu=9 > >> --get-gpa-pmap=0x0000000000000000 > >> gpa 0: 0x300002c936e007 0x300002c9353007 0x300002c9352007 0 > >> > >> root@havoc.ysv:/home/sbruno # bhyvectl --vm=vm2 --cpu=9 > >> --get-gpa-pmap=0x0000000000000000 > >> gpa 0: 0x30000286cb0007 0x300003ad105007 0x3000019b1fd007 0 > >> > >> root@havoc.ysv:/home/sbruno # bhyvectl --vm=vm3 --cpu=9 > >> --get-gpa-pmap=0x0000000000000000 > >> gpa 0: 0x300002c9348007 0x300002c9339007 0 > >> > >> > >> But there's no information available on chaos at the moment as there are > >> no active vms running. > >> > > > > Sorry, I should explained a bit more. > > > > After a bhyve(8) exits because of the EPT misconfiguration error there > > are breadcrumbs left over in the VMCS as well as the nested page > > tables. We can use them to diagnose what happened. > > > > The bhyvectl commands above should be executed after the VM exits but > > before it is restarted again. Once it restarts, the breadcrumbs get > > written over and are of no use. > > > > The "--vm=" passed to the bhyvectl command should be of the > > virtual machine that crashed. > > The "--cpu=" passed to the bhyvectl command should be the > > vcpuid that detected the EPT misconfiguration. The reason I used '9' > > as an example above was because you saw this on the console: > > > > vm exit[9] > > reason VMX > > rip 0x0000000029286336 > > > > Hope that helps. > > > > I submitted a change in r267966 to dump this information to the > console. It is also stashed in the process memory so we can inspect it > in a coredump. > > Would it be possible to upgrade chaos and/or havoc to r267966 so we > can make progress on debugging this issue? > > best > Neel > > > best > > Neel > > > >> sean > >> Yeah, I'll see if I can get that done this weekend. Waiting for build breakages to be resolved. :-) sean