Date: Fri, 27 Jun 2014 16:46:51 -0700 From: Sean Bruno <sbruno@ignoranthack.me> To: Neel Natu <neelnatu@gmail.com> Cc: "freebsd-virtualization@freebsd.org" <freebsd-virtualization@freebsd.org> Subject: Re: jenkins bhyve vms crashing and burning after several days of use Message-ID: <1403912811.5727.7.camel@bruno> In-Reply-To: <CAFgRE9H2QLzQ3mKp1a4zfNBinhVu60F0MMovuSk4sEO0y20FeQ@mail.gmail.com> References: <1403818926.2417.6.camel@bruno> <1403819194.2417.8.camel@bruno> <CAFgRE9GYHzenX7px6-Sp6BfeTVA0-jcwg=JgcGXKuBeFJXUoog@mail.gmail.com> <1403821402.2417.12.camel@bruno> <CAFgRE9HpA_LQStzPYpDUU0erqNp%2BKOrjwK%2B7A7RGfD7XTCi1Hg@mail.gmail.com> <CAFgRE9H2QLzQ3mKp1a4zfNBinhVu60F0MMovuSk4sEO0y20FeQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 2014-06-27 at 14:23 -0700, Neel Natu wrote: > Hi, > > On Thu, Jun 26, 2014 at 3:43 PM, Neel Natu <neelnatu@gmail.com> wrote: > > Hi Sean, > > > > On Thu, Jun 26, 2014 at 3:23 PM, Sean Bruno <sbruno@ignoranthack.me> wrote: > >> On Thu, 2014-06-26 at 15:00 -0700, Neel Natu wrote: > >>> Hi Sean, > >>> > >>> On Thu, Jun 26, 2014 at 2:46 PM, Sean Bruno <sbruno@ignoranthack.me> wrote: > >>> > On Thu, 2014-06-26 at 14:42 -0700, Sean Bruno wrote: > >>> >> so, we're seeing the bhyve vms running in the freebsd cluster for > >>> >> jenkins crashing and burning after a couple of days of use. > >>> >> > >>> >> vm exit[9] > >>> >> reason VMX > >>> >> rip 0x0000000029286336 > >>> >> inst_length 3 > >>> >> status 0 > >>> >> exit_reason 49 > >>> >> qualification 0x0000000000000000 > >>> >> inst_type 0 > >>> >> inst_error 0 > >>> >> > >>> >> > >>> >> It looks like we have an active core file on havoc.ysv if you have a > >>> >> moment to look at it: > >>> >> > >>> >> http://people.freebsd.org/~sbruno/bhyve.core > >>> >> > >>> >> FreeBSD havoc.ysv.freebsd.org 11.0-CURRENT FreeBSD 11.0-CURRENT #2 > >>> >> r267362: Wed Jun 11 14:56:34 UTC 2014 > >>> >> sbruno@havoc.freebsd.org:/usr/obj/usr/src/sys/HAVOC amd64 > >>> >> > >>> > > >>> > Also, from chaos.ysv > >>> > > >>> > http://people.freebsd.org/~sbruno/bhyve.core.chaos > >>> > > >>> > FreeBSD chaos.ysv.freebsd.org 11.0-CURRENT FreeBSD 11.0-CURRENT #1 > >>> > r267362: Wed Jun 11 15:50:24 UTC 2014 > >>> > sbruno@chaos.ysv.freebsd.org:/usr/obj/usr/src/sys/CHAOS amd64 > >>> > > >>> > >>> Can you tell us the processor and memory configuration on havoc and chaos? > >>> > >>> Also, could you execute the following commands on havoc: > >>> > >>> # bhyvectl --vm=vmname --cpu=9 --get-vmcs-guest-physical-address > >>> -- this will output the offending guest physical address that > >>> triggered the EPT misconfiguration > >>> > >>> # bhyvectl --vm=vmname --get-gpa-pmap=<gpa_from_above> > >>> -- this will output the page table entries in the EPT that map to the > >>> offending GPA > >>> > >>> Hopefully that provides us with something to work with. > >>> > >>> best > >>> Neel > >>> > >>> > > >> > >> chaos: > >> CPU: Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz (2200.05-MHz K8-class CPU) > >> Origin="GenuineIntel" Id=0x206d6 Family=0x6 Model=0x2d Stepping=6 > >> Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> > >> Features2=0x1fbee3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX> > >> AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM> > >> AMD Features2=0x1<LAHF> > >> TSC: P-state invariant, performance statistics > >> avail memory = 66298322944 (63227 MB) > >> > >> havoc: > >> FreeBSD clang version 3.4.1 (tags/RELEASE_34/dot1-final 208032) 20140512 > >> CPU: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz (2400.14-MHz > >> K8-class CPU) > >> Origin="GenuineIntel" Id=0x206c2 Family=0x6 Model=0x2c Stepping=2 > >> Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> > >> Features2=0x29ee3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,POPCNT,AESNI> > >> AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM> > >> AMD Features2=0x1<LAHF> > >> TSC: P-state invariant, performance statistics > >> avail memory = 16571621376 (15803 MB) > >> > > > > Thanks, we'll see if there are relevant errata for these processors. > > > > Actually these processors have entirely different microarchitectures > (Nehalem and Sandybridge) so its unlikely that this is due to > processor errata. > > >> > >> There appear to be three vms running on havoc: > >> root@havoc.ysv:/home/sbruno # bhyvectl --vm=vm1 --cpu=9 > >> --get-vmcs-guest-physical-address > >> gpa[9] 0x0000000000000000 > >> root@havoc.ysv:/home/sbruno # bhyvectl --vm=vm2 --cpu=9 > >> --get-vmcs-guest-physical-address > >> gpa[9] 0x0000000000000000 > >> root@havoc.ysv:/home/sbruno # bhyvectl --vm=vm3 --cpu=9 > >> --get-vmcs-guest-physical-address > >> gpa[9] 0x0000000000000000 > >> > >> root@havoc.ysv:/home/sbruno # bhyvectl --vm=vm1 --cpu=9 > >> --get-gpa-pmap=0x0000000000000000 > >> gpa 0: 0x300002c936e007 0x300002c9353007 0x300002c9352007 0 > >> > >> root@havoc.ysv:/home/sbruno # bhyvectl --vm=vm2 --cpu=9 > >> --get-gpa-pmap=0x0000000000000000 > >> gpa 0: 0x30000286cb0007 0x300003ad105007 0x3000019b1fd007 0 > >> > >> root@havoc.ysv:/home/sbruno # bhyvectl --vm=vm3 --cpu=9 > >> --get-gpa-pmap=0x0000000000000000 > >> gpa 0: 0x300002c9348007 0x300002c9339007 0 > >> > >> > >> But there's no information available on chaos at the moment as there are > >> no active vms running. > >> > > > > Sorry, I should explained a bit more. > > > > After a bhyve(8) exits because of the EPT misconfiguration error there > > are breadcrumbs left over in the VMCS as well as the nested page > > tables. We can use them to diagnose what happened. > > > > The bhyvectl commands above should be executed after the VM exits but > > before it is restarted again. Once it restarts, the breadcrumbs get > > written over and are of no use. > > > > The "--vm=<vmname>" passed to the bhyvectl command should be of the > > virtual machine that crashed. > > The "--cpu=<vcpuid>" passed to the bhyvectl command should be the > > vcpuid that detected the EPT misconfiguration. The reason I used '9' > > as an example above was because you saw this on the console: > > > > vm exit[9] > > reason VMX > > rip 0x0000000029286336 > > > > Hope that helps. > > > > I submitted a change in r267966 to dump this information to the > console. It is also stashed in the process memory so we can inspect it > in a coredump. > > Would it be possible to upgrade chaos and/or havoc to r267966 so we > can make progress on debugging this issue? > > best > Neel > > > best > > Neel > > > >> sean > >> Yeah, I'll see if I can get that done this weekend. Waiting for build breakages to be resolved. :-) sean
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1403912811.5727.7.camel>