From owner-freebsd-virtualization@FreeBSD.ORG Fri Jun 27 21:23:13 2014 Return-Path: Delivered-To: freebsd-virtualization@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id D77B0E8C; Fri, 27 Jun 2014 21:23:13 +0000 (UTC) Received: from mail-qg0-x231.google.com (mail-qg0-x231.google.com [IPv6:2607:f8b0:400d:c04::231]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 8710A2EAE; Fri, 27 Jun 2014 21:23:13 +0000 (UTC) Received: by mail-qg0-f49.google.com with SMTP id f51so13494qge.36 for ; Fri, 27 Jun 2014 14:23:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=P5iQ+Vz8bMkgYh0MypBPX4Ypt7kOQ8hgr9bDjtE8n4o=; b=ysXrdssZG3zChbRovd+rrf/06jyqmx6geXaQBrCG/ZP2OBil0hTRrsbbtlvSGytclx 0vi/hyWrFsA5+Oet8cLLAljYGPeCofbUtsM/EqvKHUtPU4goNwnRFanq/E9oBnAESZbI nODz8f9TK9BmeKSHeepZQVz+uIjbUq4QCC4xh5Bntk2+Qgz2kOHFY+vBO2cDt/FcDZXj 21yr+QRFSyvvWbgY+QElk83AOKIPqBQtONCU0xw1dVfq7bMpJfSH/wp2LeKWdR4vXAqv 5AZH7dcIFPgO1u5T+5SG1In8fv4/J2FCEuOR5uVEBddmMR0txMxUETHQZ5JSM4syRN/f x7Xg== MIME-Version: 1.0 X-Received: by 10.224.47.77 with SMTP id m13mr38657908qaf.69.1403904192675; Fri, 27 Jun 2014 14:23:12 -0700 (PDT) Received: by 10.140.48.37 with HTTP; Fri, 27 Jun 2014 14:23:12 -0700 (PDT) In-Reply-To: References: <1403818926.2417.6.camel@bruno> <1403819194.2417.8.camel@bruno> <1403821402.2417.12.camel@bruno> Date: Fri, 27 Jun 2014 14:23:12 -0700 Message-ID: Subject: Re: jenkins bhyve vms crashing and burning after several days of use From: Neel Natu To: Sean Bruno Content-Type: text/plain; charset=UTF-8 Cc: "freebsd-virtualization@freebsd.org" X-BeenThere: freebsd-virtualization@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: "Discussion of various virtualization techniques FreeBSD supports." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 27 Jun 2014 21:23:13 -0000 Hi, On Thu, Jun 26, 2014 at 3:43 PM, Neel Natu wrote: > Hi Sean, > > On Thu, Jun 26, 2014 at 3:23 PM, Sean Bruno wrote: >> On Thu, 2014-06-26 at 15:00 -0700, Neel Natu wrote: >>> Hi Sean, >>> >>> On Thu, Jun 26, 2014 at 2:46 PM, Sean Bruno wrote: >>> > On Thu, 2014-06-26 at 14:42 -0700, Sean Bruno wrote: >>> >> so, we're seeing the bhyve vms running in the freebsd cluster for >>> >> jenkins crashing and burning after a couple of days of use. >>> >> >>> >> vm exit[9] >>> >> reason VMX >>> >> rip 0x0000000029286336 >>> >> inst_length 3 >>> >> status 0 >>> >> exit_reason 49 >>> >> qualification 0x0000000000000000 >>> >> inst_type 0 >>> >> inst_error 0 >>> >> >>> >> >>> >> It looks like we have an active core file on havoc.ysv if you have a >>> >> moment to look at it: >>> >> >>> >> http://people.freebsd.org/~sbruno/bhyve.core >>> >> >>> >> FreeBSD havoc.ysv.freebsd.org 11.0-CURRENT FreeBSD 11.0-CURRENT #2 >>> >> r267362: Wed Jun 11 14:56:34 UTC 2014 >>> >> sbruno@havoc.freebsd.org:/usr/obj/usr/src/sys/HAVOC amd64 >>> >> >>> > >>> > Also, from chaos.ysv >>> > >>> > http://people.freebsd.org/~sbruno/bhyve.core.chaos >>> > >>> > FreeBSD chaos.ysv.freebsd.org 11.0-CURRENT FreeBSD 11.0-CURRENT #1 >>> > r267362: Wed Jun 11 15:50:24 UTC 2014 >>> > sbruno@chaos.ysv.freebsd.org:/usr/obj/usr/src/sys/CHAOS amd64 >>> > >>> >>> Can you tell us the processor and memory configuration on havoc and chaos? >>> >>> Also, could you execute the following commands on havoc: >>> >>> # bhyvectl --vm=vmname --cpu=9 --get-vmcs-guest-physical-address >>> -- this will output the offending guest physical address that >>> triggered the EPT misconfiguration >>> >>> # bhyvectl --vm=vmname --get-gpa-pmap= >>> -- this will output the page table entries in the EPT that map to the >>> offending GPA >>> >>> Hopefully that provides us with something to work with. >>> >>> best >>> Neel >>> >>> > >> >> chaos: >> CPU: Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz (2200.05-MHz K8-class CPU) >> Origin="GenuineIntel" Id=0x206d6 Family=0x6 Model=0x2d Stepping=6 >> Features=0xbfebfbff >> Features2=0x1fbee3ff >> AMD Features=0x2c100800 >> AMD Features2=0x1 >> TSC: P-state invariant, performance statistics >> avail memory = 66298322944 (63227 MB) >> >> havoc: >> FreeBSD clang version 3.4.1 (tags/RELEASE_34/dot1-final 208032) 20140512 >> CPU: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz (2400.14-MHz >> K8-class CPU) >> Origin="GenuineIntel" Id=0x206c2 Family=0x6 Model=0x2c Stepping=2 >> Features=0xbfebfbff >> Features2=0x29ee3ff >> AMD Features=0x2c100800 >> AMD Features2=0x1 >> TSC: P-state invariant, performance statistics >> avail memory = 16571621376 (15803 MB) >> > > Thanks, we'll see if there are relevant errata for these processors. > Actually these processors have entirely different microarchitectures (Nehalem and Sandybridge) so its unlikely that this is due to processor errata. >> >> There appear to be three vms running on havoc: >> root@havoc.ysv:/home/sbruno # bhyvectl --vm=vm1 --cpu=9 >> --get-vmcs-guest-physical-address >> gpa[9] 0x0000000000000000 >> root@havoc.ysv:/home/sbruno # bhyvectl --vm=vm2 --cpu=9 >> --get-vmcs-guest-physical-address >> gpa[9] 0x0000000000000000 >> root@havoc.ysv:/home/sbruno # bhyvectl --vm=vm3 --cpu=9 >> --get-vmcs-guest-physical-address >> gpa[9] 0x0000000000000000 >> >> root@havoc.ysv:/home/sbruno # bhyvectl --vm=vm1 --cpu=9 >> --get-gpa-pmap=0x0000000000000000 >> gpa 0: 0x300002c936e007 0x300002c9353007 0x300002c9352007 0 >> >> root@havoc.ysv:/home/sbruno # bhyvectl --vm=vm2 --cpu=9 >> --get-gpa-pmap=0x0000000000000000 >> gpa 0: 0x30000286cb0007 0x300003ad105007 0x3000019b1fd007 0 >> >> root@havoc.ysv:/home/sbruno # bhyvectl --vm=vm3 --cpu=9 >> --get-gpa-pmap=0x0000000000000000 >> gpa 0: 0x300002c9348007 0x300002c9339007 0 >> >> >> But there's no information available on chaos at the moment as there are >> no active vms running. >> > > Sorry, I should explained a bit more. > > After a bhyve(8) exits because of the EPT misconfiguration error there > are breadcrumbs left over in the VMCS as well as the nested page > tables. We can use them to diagnose what happened. > > The bhyvectl commands above should be executed after the VM exits but > before it is restarted again. Once it restarts, the breadcrumbs get > written over and are of no use. > > The "--vm=" passed to the bhyvectl command should be of the > virtual machine that crashed. > The "--cpu=" passed to the bhyvectl command should be the > vcpuid that detected the EPT misconfiguration. The reason I used '9' > as an example above was because you saw this on the console: > > vm exit[9] > reason VMX > rip 0x0000000029286336 > > Hope that helps. > I submitted a change in r267966 to dump this information to the console. It is also stashed in the process memory so we can inspect it in a coredump. Would it be possible to upgrade chaos and/or havoc to r267966 so we can make progress on debugging this issue? best Neel > best > Neel > >> sean >>