From owner-freebsd-arch@FreeBSD.ORG Tue Apr 28 13:45:10 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 011F9D3A; Tue, 28 Apr 2015 13:45:10 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id CE1E01FD3; Tue, 28 Apr 2015 13:45:09 +0000 (UTC) Received: from ralph.baldwin.cx (pool-173-54-116-245.nwrknj.fios.verizon.net [173.54.116.245]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 87851B95B; Tue, 28 Apr 2015 09:45:08 -0400 (EDT) From: John Baldwin To: freebsd-arch@freebsd.org Cc: Adrian Chadd , Davide Italiano Subject: Re: RFC: setting performance_cx_lowest=C2 in -HEAD to avoid lock contention on many-CPU boxes Date: Tue, 28 Apr 2015 09:35:10 -0400 Message-ID: <1832557.zVusTDjZUx@ralph.baldwin.cx> User-Agent: KMail/4.14.2 (FreeBSD/10.1-STABLE; KDE/4.14.2; amd64; ; ) In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Tue, 28 Apr 2015 09:45:08 -0400 (EDT) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Apr 2015 13:45:10 -0000 On Saturday, April 25, 2015 11:45:10 AM Adrian Chadd wrote: > On 25 April 2015 at 11:18, Davide Italiano wrote: > > On Sat, Apr 25, 2015 at 9:31 AM, Adrian Chadd wrote: > >> Hi! > >> > >> I've been doing some NUMA testing on large boxes and I've found that > >> there's lock contention in the ACPI path. It's due to my change a > >> while ago to start using sleep states above ACPI C1 by default. The > >> ACPI C3 state involves a bunch of register fiddling in the ACPI sleep > >> path that grabs a serialiser lock, and on an 80 thread box this is > >> costly. > >> > >> I'd like to drop performance_cx_lowest to C2 in -HEAD. ACPI C2 state > >> doesn't require the same register fiddling (to disable bus mastering, > >> if I'm reading it right) and so it doesn't enter that particular > >> serialised path. I've verified on Westmere-EX, Sandybridge, Ivybridge > >> and Haswell boxes that ACPI C2 does let one drop down into a deeper > >> CPU sleep state (C6 on each of these). I think is still a good default > >> for both servers and desktops. > >> > >> If no-one has a problem with this then I'll do it after the weekend. > >> > > > > This sounds to me just a way to hide a problem. > > Very few people nowaday run on NUMA and they can tune the machine as > > they like when they do testing. > > If there's a lock contention problem, it needs to be fixed and not > > hidden under another default. > > The lock contention problem is inside ACPI and how it's designed/implemented. > We're not going to easily be able to make ACPI lock "better" as we're > constrained by how ACPI implements things in the shared ACPICA code. Is the contention actually harmful? Note that this only happens when the CPUs are idle, not when doing actual work. In addition, IIRC, the ACPI idle stuff uses hueristics to only drop into deeper sleep states if the CPU has recently been idle "more" so that if you are relatively busy you will only go into C1 instead. (I think this latter might have changed since eventtimers came in, it looks like we now choose the idle state based on how long until the next timer interrupt?) If the only consequence of this is that it adds noise to profiling, then hack your profiling results to ignore this lock. I think that is a better tradeoff than sacraficing power gains to reduce noise in profiling output. Alternatively, your machine may be better off using cpu_idle_mwait. There are already CPUs now that only advertise deeper sleep states for use with mwait but not ACPI, so we may certainly end up with defaulting to mwait instead of ACPI for certain CPUs anyway. -- John Baldwin