From owner-freebsd-hackers@FreeBSD.ORG Wed Mar 10 11:28:01 2010 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0F0EA106566B; Wed, 10 Mar 2010 11:28:01 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id 42F788FC0C; Wed, 10 Mar 2010 11:27:59 +0000 (UTC) Received: from deviant.kiev.zoral.com.ua (root@deviant.kiev.zoral.com.ua [10.1.1.148]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id o2ABRrFD000812 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 10 Mar 2010 13:27:53 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.4/8.14.4) with ESMTP id o2ABRru7082765; Wed, 10 Mar 2010 13:27:53 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.4/8.14.4/Submit) id o2ABRrwt082764; Wed, 10 Mar 2010 13:27:53 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Wed, 10 Mar 2010 13:27:53 +0200 From: Kostik Belousov To: Kevin Day Message-ID: <20100310112753.GW2489@deviant.kiev.zoral.com.ua> References: <2C7A849F-2571-48E7-AA75-B6F87C2352C1@dragondata.com> <201003091727.09188.jhb@freebsd.org> <207B4180-B8AF-4C93-8BC7-7F1FFEEBB713@dragondata.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="nrCuQK91QKw8CgBg" Content-Disposition: inline In-Reply-To: <207B4180-B8AF-4C93-8BC7-7F1FFEEBB713@dragondata.com> User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: freebsd-hackers@freebsd.org Subject: Re: Extremely slow boot on VMWare with Opteron 2352 (acpi?) X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 10 Mar 2010 11:28:01 -0000 --nrCuQK91QKw8CgBg Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Mar 09, 2010 at 06:42:02PM -0600, Kevin Day wrote: >=20 > On Mar 9, 2010, at 4:27 PM, John Baldwin wrote: >=20 > > On Tuesday 09 March 2010 3:40:26 pm Kevin Day wrote: > >>=20 > >>=20 > >> If I boot up on an Opteron 2218 system, it boots normally. If I boot t= he=20 > > exact same VM moved to a 2352, I get: > >>=20 > >> acpi0: on motherboard > >> PCIe: Memory Mapped configuration base @ 0xe0000000 > >> (very long pause) > >> ioapic0: routing intpin 9 (ISA IRQ 9) to lapic 0 vector 48 > >> acpi0: [MPSAFE] > >> acpi0: [ITHREAD] > >>=20 > >> then booting normally. > >=20 > > It's probably worth adding some printfs to narrow down where the pause = is=20 > > happening. This looks to be all during the acpi_attach() routine, so m= aybe=20 > > you can start there. >=20 > Okay, good pointer. This is what I've narrowed down: >=20 > acpi_enable_pcie() calls pcie_cfgregopen(). It's called here with pcie_cf= gregopen(0xe0000000, 0, 255). inside pcie_cfgregopen, the pause starts here: >=20 > /* XXX: We should make sure this really fits into the direct map.= */ > pcie_base =3D (vm_offset_t)pmap_mapdev(base, (maxbus + 1) << 20); >=20 > pmap_mapdev calls pmap_mapdev_attr, and in there this evaluates to true: >=20 > /* > * If the specified range of physical addresses fits within the d= irect > * map window, use the direct map.=20 > */ > if (pa < dmaplimit && pa + size < dmaplimit) { >=20 > so we call pmap_change_attr which called pmap_change_attr_locked. It's ch= anging 0x10000000 bytes starting at 0xffffff00e0000000. The very last line= before returning from pmap_change_attr_locked is: >=20 > pmap_invalidate_cache_range(base, tmpva); >=20 > And this is where the delay is. This is calling MFENCE/CLFLUSH in a loop = 8 million times. We actually had a problem with CLFLUSH causing panics on t= hese same CPUs under Xen, which is partially why we're looking at VMware no= w. (see kern/138863). I'm wondering if VMware didn't encounter the same pro= blem and replace CLFLUSH with a software emulated version that is far slowe= r... based on the speed is probably invalidating the entire cache. A quick = change to pmap_invalidate_cache_range to just clear the entire cache if the= area being cleared is over 8MB seems to have fixed it. i.e.: >=20 > else if (cpu_feature & CPUID_CLFSH) { >=20 > to >=20 > else if ((cpu_feature & CPUID_CLFSH) && ((eva-sva) < (2<<22))) { >=20 >=20 > However, I'm a little blurry on if everything leading to this point is co= rrect. It's ending up with 256MB of memory for the pci area, which seems re= ally excessive. Is the problem just that it wants room for 256 busses, or..= .? Anyone know this code path well enough to know if this is deviating from= the norm? I think that the idea not to for CLFLUSH in the loop for large regions is good. We do not extract the L2/L3 cache size now, I suppose that 2MB estimation is good for most situations. commit bbac1632d349d68b905df644656ce9a8e4aed094 Author: Konstantin Belousov Date: Wed Mar 10 13:07:51 2010 +0200 Fall back to wbinvd when region for CLFLUSH is >=3D 2MB. =20 Submitted by: Kevin Day diff --git a/sys/amd64/amd64/pmap.c b/sys/amd64/amd64/pmap.c index 07db5d1..4361be0 100644 --- a/sys/amd64/amd64/pmap.c +++ b/sys/amd64/amd64/pmap.c @@ -994,7 +994,8 @@ pmap_invalidate_cache_range(vm_offset_t sva, vm_offset_= t eva) =20 if (cpu_feature & CPUID_SS) ; /* If "Self Snoop" is supported, do nothing. */ - else if (cpu_feature & CPUID_CLFSH) { + else if ((cpu_feature & CPUID_CLFSH) !=3D 0 && + eva - sva < 2 * 1024 * 1024) { =20 /* * Otherwise, do per-cache line flush. Use the mfence @@ -1011,7 +1012,8 @@ pmap_invalidate_cache_range(vm_offset_t sva, vm_offse= t_t eva) =20 /* * No targeted cache flush methods are supported by CPU, - * globally invalidate cache as a last resort. + * or the supplied range is bigger then 2MB. + * Globally invalidate cache. */ pmap_invalidate_cache(); } diff --git a/sys/i386/i386/pmap.c b/sys/i386/i386/pmap.c index 4b2e34f..f448071 100644 --- a/sys/i386/i386/pmap.c +++ b/sys/i386/i386/pmap.c @@ -996,7 +996,8 @@ pmap_invalidate_cache_range(vm_offset_t sva, vm_offset_= t eva) =20 if (cpu_feature & CPUID_SS) ; /* If "Self Snoop" is supported, do nothing. */ - else if (cpu_feature & CPUID_CLFSH) { + else if ((cpu_feature & CPUID_CLFSH) !=3D 0 && + eva - sva < 2 * 1024 * 1024) { =20 /* * Otherwise, do per-cache line flush. Use the mfence @@ -1013,7 +1014,8 @@ pmap_invalidate_cache_range(vm_offset_t sva, vm_offse= t_t eva) =20 /* * No targeted cache flush methods are supported by CPU, - * globally invalidate cache as a last resort. + * or the supplied range is bigger then 2MB. + * Globally invalidate cache. */ pmap_invalidate_cache(); } --nrCuQK91QKw8CgBg Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (FreeBSD) iEYEARECAAYFAkuXgjgACgkQC3+MBN1Mb4h9bgCdHEWAhJgy8etu0V/25HzAUReT HAQAoOg1b0P04PSDQgGlbHb4Xz+bpXSv =A58O -----END PGP SIGNATURE----- --nrCuQK91QKw8CgBg--