From owner-freebsd-hackers@FreeBSD.ORG Thu May 8 18:42:36 2014 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 0E30B2A9; Thu, 8 May 2014 18:42:36 +0000 (UTC) Received: from na01-bn1-obe.outbound.protection.outlook.com (mail-bn1lp0139.outbound.protection.outlook.com [207.46.163.139]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (Client CN "mail.protection.outlook.com", Issuer "MSIT Machine Auth CA 2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 16402354; Thu, 8 May 2014 18:42:34 +0000 (UTC) Received: from BY2PR05MB584.namprd05.prod.outlook.com (10.141.219.153) by BY2PR05MB773.namprd05.prod.outlook.com (10.141.224.140) with Microsoft SMTP Server (TLS) id 15.0.929.12; Thu, 8 May 2014 18:42:26 +0000 Received: from BY2PR05MB582.namprd05.prod.outlook.com (10.141.219.146) by BY2PR05MB584.namprd05.prod.outlook.com (10.141.219.153) with Microsoft SMTP Server (TLS) id 15.0.934.12; Thu, 8 May 2014 18:42:25 +0000 Received: from BY2PR05MB582.namprd05.prod.outlook.com ([10.141.219.146]) by BY2PR05MB582.namprd05.prod.outlook.com ([10.141.219.146]) with mapi id 15.00.0934.000; Thu, 8 May 2014 18:42:25 +0000 From: Andrew Duane To: John Nielsen , John Baldwin Subject: RE: consistent VM hang during reboot Thread-Topic: consistent VM hang during reboot Thread-Index: AQHPauL2mJCY45JsDEqXh7d069Bqsps297yAgAAMMTA= Date: Thu, 8 May 2014 18:42:24 +0000 Message-ID: References: <201405081303.17079.jhb@freebsd.org> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [66.129.241.10] x-forefront-prvs: 0205EDCD76 x-forefront-antispam-report: SFV:NSPM; SFS:(10009001)(6009001)(428001)(189002)(199002)(377454003)(24454002)(51704005)(52604005)(13464003)(81542001)(4396001)(74316001)(92566001)(21056001)(81342001)(80022001)(66066001)(86362001)(20776003)(64706001)(79102001)(31966008)(74502001)(74662001)(46102001)(83072002)(85852003)(99396002)(15975445006)(99286001)(2656002)(33646001)(76576001)(87936001)(15202345003)(50986999)(77096999)(76176999)(101416001)(54356999)(19580405001)(76482001)(83322001)(19580395003)(77982001)(24736002); DIR:OUT; SFP:1101; SCL:1; SRVR:BY2PR05MB584; H:BY2PR05MB582.namprd05.prod.outlook.com; FPR:; MLV:sfv; PTR:InfoNoRecords; MX:1; A:1; LANG:en; received-spf: None (: juniper.net does not designate permitted sender hosts) authentication-results: spf=none (sender IP is ) smtp.mailfrom=aduane@juniper.net; Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: juniper.net Cc: "freebsd-hackers@freebsd.org" , "freebsd-virtualization@freebsd.org" X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 08 May 2014 18:42:36 -0000 When I was doing some early work on some of the Octeon multi-core chips, I = encountered something similar. If I remember correctly, there was an issue = in the shutdown sequence that did not properly halt the cores and set up th= e "start jump" vector. So the first core would start, and when it tried to = start the next ones it would hang waiting for the ACK that they were runnin= g (since they didn't have a start vector and hence never started). I know M= IPS, not AMD, so I can't say what the equivalent would be, but I'm sure the= re is one. Check that part, setting up the early state. If Juli and/or Adrian are reading this: do you remember anything about that= , something like 2 years ago? .................................... Andrew L. Duane AT&T Technical Lead JNCIA - JUNOS m=A0=A0=A0+1 603.770.7088 o +1 408.933.6944 (2-6944) skype: andrewlduane aduane@juniper.net -----Original Message----- From: owner-freebsd-hackers@freebsd.org [mailto:owner-freebsd-hackers@freeb= sd.org] On Behalf Of John Nielsen Sent: Thursday, May 08, 2014 1:56 PM To: John Baldwin Cc: freebsd-hackers@freebsd.org; freebsd-virtualization@freebsd.org Subject: Re: consistent VM hang during reboot On May 8, 2014, at 11:03 AM, John Baldwin wrote: > On Wednesday, May 07, 2014 7:15:43 pm John Nielsen wrote: >> I am trying to solve a problem with amd64 FreeBSD virtual machines runni= ng on a Linux+KVM hypervisor. To be honest I'm not sure if the problem is i= n FreeBSD or=20 > the hypervisor, but I'm trying to rule out the OS first. >>=20 >> The _second_ time FreeBSD boots in a virtual machine with more than one = core, the boot hangs just before the kernel would normally print e.g. "SMP:= AP CPU #1=20 > Launched!" (The last line on the console is "usbus0: 12Mbps Full Speed US= B v1.0", but the problem persists even without USB). The VM will boot fine = a first time,=20 > but running either "shutdown -r now" OR "reboot" will lead to a hung seco= nd boot. Stopping and starting the host qemu-kvm process is the only way to= continue. >>=20 >> The problem seems to be triggered by something in the SMP portion of cpu= _reset() (from sys/amd64/amd64/vm_machdep.c). If I hit the virtual "reset" = button the next=20 > boot is fine. If I have 'kern.smp.disabled=3D"1"' set for the initial boo= t then subsequent boots are fine (but I can only use one CPU core, of cours= e). However, if I=20 > boot normally the first time then set 'kern.smp.disabled=3D"1"' for the s= econd (re)boot, the problem is triggered. Apparently something in the shutd= own code is=20 > "poisoning the well" for the next boot. >>=20 >> The problem is present in FreeBSD 8.4, 9.2, 10.0 and 11-CURRENT as of ye= sterday. >>=20 >> This (heavy-handed and wrong) patch (to HEAD) lets me avoid the issue: >>=20 >> --- sys/amd64/amd64/vm_machdep.c.orig 2014-05-07 13:19:07.400981580 -060= 0 >> +++ sys/amd64/amd64/vm_machdep.c 2014-05-07 17:02:52.416783795 -0600 >> @@ -593,7 +593,7 @@ >> void >> cpu_reset() >> { >> -#ifdef SMP >> +#if 0 >> cpuset_t map; >> u_int cnt; >>=20 >> I've tried skipping or disabling smaller chunks of code within the #if b= lock but haven't found a consistent winner yet. >>=20 >> I'm hoping the list will have suggestions on how I can further narrow do= wn the problem, or theories on what might be going on. >=20 > Can you try forcing the reboot to occur on the BSP (via 'cpuset -l 0 rebo= ot') > or a non-BSP ('cpuset -l 1 reboot') to see if that has any effect? It mi= ght > not, but if it does it would help narrow down the code to consider. Hello jhb, thanks for responding. I tried your suggestion but unfortunately it does not make any difference. = The reboot hangs regardless of which CPU I assign the command to. Any other suggestions? JN _______________________________________________ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"