From owner-freebsd-arch@FreeBSD.ORG Thu Nov 1 18:36:15 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 92240DD6; Thu, 1 Nov 2012 18:36:15 +0000 (UTC) (envelope-from jim.harris@gmail.com) Received: from mail-vb0-f54.google.com (mail-vb0-f54.google.com [209.85.212.54]) by mx1.freebsd.org (Postfix) with ESMTP id 016578FC08; Thu, 1 Nov 2012 18:36:14 +0000 (UTC) Received: by mail-vb0-f54.google.com with SMTP id l1so3891795vba.13 for ; Thu, 01 Nov 2012 11:36:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=fA2mkPhEPFClBZ6DyzvaxAVbf3nQlVPJqv0WaBiGHyA=; b=Fv1nHpSabKszh/JIDwrr4g4Kvp+e1/NV8M1exm0iRfu1Z+DkfoECe8udQ5r0WOUxj0 0JW1zw3bCfF3JVJnxKxdcKz1AzxUeuCCwcA8ZPl6UnBrwvHwVLqPiojxILT3n7SpJ1T9 Q0Zv36w31Zz7HNwTsled+K0M4fG+BBigVfs5kJEwiv/qIjzuua7aCVkTgi2yVErpsbCl xP+k1wxjXdVzAnyjL4G+Gta8T+2QW4Dz5j4Cb5jqIlVSMY1OuDDIxfQiTjjmtcHjx2DK rxK9SH5k/ppFiGr5g4noG0UlKZ+Xm/+E3Ed9dSmnyg6DY0H7NN1j0Ne9bIAfygmkR2Nz ozoA== MIME-Version: 1.0 Received: by 10.220.226.67 with SMTP id iv3mr23829385vcb.57.1351794974098; Thu, 01 Nov 2012 11:36:14 -0700 (PDT) Received: by 10.58.225.2 with HTTP; Thu, 1 Nov 2012 11:36:13 -0700 (PDT) In-Reply-To: <50928AE5.4010107@freebsd.org> References: <201210250918.00602.jhb@freebsd.org> <5089690A.8070503@networx.ch> <201210251732.31631.jhb@freebsd.org> <50928AE5.4010107@freebsd.org> Date: Thu, 1 Nov 2012 11:36:13 -0700 Message-ID: Subject: Re: CACHE_LINE_SIZE on x86 From: Jim Harris To: Andre Oppermann Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: Attilio Rao , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Nov 2012 18:36:15 -0000 On Thu, Nov 1, 2012 at 7:44 AM, Andre Oppermann wrote: > On 01.11.2012 01:50, Jim Harris wrote: > >> >> >> On Thu, Oct 25, 2012 at 2:40 PM, Jim Harris > jim.harris@gmail.com>> wrote: >> >> >> On Thu, Oct 25, 2012 at 2:32 PM, John Baldwin > jhb@freebsd.org>> wrote: >> > >> > It would be good to know though if there are performance benefits >> from >> > avoiding sharing across paired lines in this manner. Even if it >> has >> > its own MOESI state, there might still be negative effects from >> sharing >> > the pair. >> >> On 2S, I do see further benefits by using 128 byte padding instead of >> 64. On 1S, I see no difference. I've been meaning to turn off >> prefetching on my system to see if it has any effect in the 2S case - >> I can give that a shot tomorrow. >> >> >> So tomorrow turned into next week, but I have some data finally. >> >> I've updated to HEAD from today, including all of the mtx_padalign >> changes. I tested 64 v. 128 byte >> alignment on 2S amd64 (SNB Xeon). My BIOS also has a knob to disable the >> adjacent line prefetching >> (MLC spatial prefetcher), so I ran both 64b and 128b against this >> specific prefetcher both enabled >> and disabled. >> >> MLC prefetcher enabled: 3-6% performance improvement, 1-5% decrease in >> CPU utilization by using 128b >> padding instead of 64b. >> > > Just to be sure. The numbers you show are just for the one location you've > converted to the new padded mutex and a particular test case? > There are two locations actually - the struct tdq lock in the ULE scheduler, and the callout_cpu lock in kern_timeout.c. And yes, I've been only running a custom benchmark I developed here to help to try to uncover some of these areas of spinlock contention. It was originally used for NVMe driver performance testing, but has been helpful in uncovering some other issues outside of the NVMe driver itself (such as these contended spinlocks). It spawns a large number of kernel threads, each of which submits an I/O and then sleeps until it is woken by the interrupt thread when the I/O completes. It stresses the scheduler and also callout since I start and stop a timer for each I/O. I think the only thing proves is that there is benefit to having x86 CACHE_LINE_SIZE still set to 128. Thanks, -Jim