From owner-freebsd-arch@FreeBSD.ORG Tue Apr 28 16:55:14 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id EF0C3BC7; Tue, 28 Apr 2015 16:55:14 +0000 (UTC) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 7822D1882; Tue, 28 Apr 2015 16:55:14 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id t3SGt3Ji045172 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Tue, 28 Apr 2015 19:55:04 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua t3SGt3Ji045172 Received: (from kostik@localhost) by tom.home (8.14.9/8.14.9/Submit) id t3SGt3sn045168; Tue, 28 Apr 2015 19:55:03 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Tue, 28 Apr 2015 19:55:03 +0300 From: Konstantin Belousov To: John Baldwin Cc: freebsd-arch@freebsd.org, Davide Italiano , Adrian Chadd Subject: Re: RFC: setting performance_cx_lowest=C2 in -HEAD to avoid lock contention on many-CPU boxes Message-ID: <20150428165503.GK2390@kib.kiev.ua> References: <1832557.zVusTDjZUx@ralph.baldwin.cx> <20150428141302.GH2390@kib.kiev.ua> <3094092.O50xjOxef9@ralph.baldwin.cx> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3094092.O50xjOxef9@ralph.baldwin.cx> User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.0 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Apr 2015 16:55:15 -0000 On Tue, Apr 28, 2015 at 10:23:33AM -0400, John Baldwin wrote: > On Tuesday, April 28, 2015 05:13:02 PM Konstantin Belousov wrote: > > On Tue, Apr 28, 2015 at 09:35:10AM -0400, John Baldwin wrote: > > > On Saturday, April 25, 2015 11:45:10 AM Adrian Chadd wrote: > > > > On 25 April 2015 at 11:18, Davide Italiano wrote: > > > > > On Sat, Apr 25, 2015 at 9:31 AM, Adrian Chadd wrote: > > > > >> Hi! > > > > >> > > > > >> I've been doing some NUMA testing on large boxes and I've found that > > > > >> there's lock contention in the ACPI path. It's due to my change a > > > > >> while ago to start using sleep states above ACPI C1 by default. The > > > > >> ACPI C3 state involves a bunch of register fiddling in the ACPI sleep > > > > >> path that grabs a serialiser lock, and on an 80 thread box this is > > > > >> costly. > > > > >> > > > > >> I'd like to drop performance_cx_lowest to C2 in -HEAD. ACPI C2 state > > > > >> doesn't require the same register fiddling (to disable bus mastering, > > > > >> if I'm reading it right) and so it doesn't enter that particular > > > > >> serialised path. I've verified on Westmere-EX, Sandybridge, Ivybridge > > > > >> and Haswell boxes that ACPI C2 does let one drop down into a deeper > > > > >> CPU sleep state (C6 on each of these). I think is still a good default > > > > >> for both servers and desktops. > > > > >> > > > > >> If no-one has a problem with this then I'll do it after the weekend. > > > > >> > > > > > > > > > > This sounds to me just a way to hide a problem. > > > > > Very few people nowaday run on NUMA and they can tune the machine as > > > > > they like when they do testing. > > > > > If there's a lock contention problem, it needs to be fixed and not > > > > > hidden under another default. > > > > > > > > The lock contention problem is inside ACPI and how it's designed/implemented. > > > > We're not going to easily be able to make ACPI lock "better" as we're > > > > constrained by how ACPI implements things in the shared ACPICA code. > > > > > > Is the contention actually harmful? Note that this only happens when the > > > CPUs are idle, not when doing actual work. In addition, IIRC, the ACPI idle > > > stuff uses hueristics to only drop into deeper sleep states if the CPU has > > > recently been idle "more" so that if you are relatively busy you will only go > > > into C1 instead. (I think this latter might have changed since eventtimers > > > came in, it looks like we now choose the idle state based on how long until > > > the next timer interrupt?) > > You have to spin, waiting other cores, to get the right to reduce the > > power state. > > Yes, normally spinning wouldn't do that, but the cpu idle hooks run with > interrupts disabled. We could fix that perhaps though Acpi doesn't quite > have what we would want (a single op that would disable interrupts after > grabbing the lock, do the test and set of the bit in question and return > its old value leaving interrupts disabled after dropping the lock). > > However, I would still like to know if the contention here is actually > harmful in some measurable way aside from showing up in profiling output. I think Adrian could run intel pmc on his box with C2 and C3 and compare the power reports. > > > > Alternatively, your machine may be better off using cpu_idle_mwait. There > > > are already CPUs now that only advertise deeper sleep states for use with > > > mwait but not ACPI, so we may certainly end up with defaulting to mwait > > > instead of ACPI for certain CPUs anyway. > > > > cpu_idle_mwait is quite useless, it only enters C1, which should be > > almost the same as hlt. mwait for C1 might reduce latency of waking up, > > but definitely would not reduce power consumption on par with higher Cx. > > Mmm, it was your pending patch I was thinking of. Don't you use mwait with > the hints to use deeper sleep states in your change? Only in the acpi idle method. It is not safe to blindly enter states higher than C1 with mwait. Intel wrote a driver for Linux which does not rely on ACPU _CST tables for this. The driver has hard-coded tables for cores >= Nehalem which specify supported states, latency and cache behaviour. This is what I tried to mention in the original mail. If we write such driver (and rip the tables from Linux), we could allow deeper states in the cpu_idle_mwait. But I remember that avg did not liked the approach, and I agree that this is not maintanable, if you are not Intel. > > > That said, I think that for non-laptop usage, limiting lowest state to C2 > > is fine. For Haswells, Intel recommendation for BIOS writers is to > > limit the announced states to C2 (eliminating the BM avoidance at all). > > Internally ACPI C2 is mapped to CPU C6 or might be even C7. > > The problem of course is detecting non-laptops. :-/ In my own crude > measurements based on the power draw numbers in the BMC on recent > SuperMicro X9 boards for SandyBridge servers, most of the gain you get is > from C2; C3 doesn't add much difference once you are able to do C2. Also of > note is the comment above the busmaster register in question about USB. I'm > not sure if that is still true anymore. If it were, systems would never go > into C3 in which case this would be a moot point and there would be no need to > enable C3. I remember turbo boost requires C3, and non-trivially deep package C states on older CPUs also require C3. This is an argument against Adrian' change, but I think it is not applicable on newer processors.