From owner-cvs-src@FreeBSD.ORG  Thu Jun 19 15:45:58 2008
Return-Path: <owner-cvs-src@FreeBSD.ORG>
Delivered-To: cvs-src@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 52C3C1065675
	for <cvs-src@freebsd.org>; Thu, 19 Jun 2008 15:45:58 +0000 (UTC)
	(envelope-from asmrookie@gmail.com)
Received: from fg-out-1718.google.com (fg-out-1718.google.com [72.14.220.156])
	by mx1.freebsd.org (Postfix) with ESMTP id BA8FA8FC19
	for <cvs-src@freebsd.org>; Thu, 19 Jun 2008 15:45:57 +0000 (UTC)
	(envelope-from asmrookie@gmail.com)
Received: by fg-out-1718.google.com with SMTP id l26so470315fgb.35
	for <cvs-src@freebsd.org>; Thu, 19 Jun 2008 08:45:56 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:received:received:message-id:date:from:sender
	:to:subject:cc:in-reply-to:mime-version:content-type
	:content-transfer-encoding:content-disposition:references
	:x-google-sender-auth;
	bh=BhvKKA83djpnriiHqmO2tXYdY+JPxVmLIf1tc+Xt6bM=;
	b=vF2JdKKpcU8B0TGL7hqBNDuY6S+FLktk8FqOpaB29Srv+KqvCFtKWOPrW52V2P9jN2
	ilSQZ9m2aBvwZ+BmHj9QzWEPm7Il/4ZwOzzIjnlQfa9AJFgEExI50NMR6JjEn8B6FYp5
	fz+tmIVO7YrClP01JxaAbwB4FNwTLfQyP/jx4=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version
	:content-type:content-transfer-encoding:content-disposition
	:references:x-google-sender-auth;
	b=IVk+b2ShZKXx5oKQsRjKE07TM71b0DN74kXoTTdTNn4qaSg0/xtLu8v1/LRHl8jejw
	3aZmE17PTzZmOoPJ33ysmBmtPqoiThsxHs3pTUyclFRQ/xppZkCEa7PL4bL/lGsE+CPW
	h62L717KB82W7UVWVFcbDl989LtG0aMUlv51k=
Received: by 10.86.23.17 with SMTP id 17mr2337479fgw.32.1213890356681;
	Thu, 19 Jun 2008 08:45:56 -0700 (PDT)
Received: by 10.86.2.18 with HTTP; Thu, 19 Jun 2008 08:45:56 -0700 (PDT)
Message-ID: <3bbf2fe10806190845p15e0758cre88cd83ec0bd975d@mail.gmail.com>
Date: Thu, 19 Jun 2008 17:45:56 +0200
From: "Attilio Rao" <attilio@freebsd.org>
Sender: asmrookie@gmail.com
To: "Peter Wemm" <peter@freebsd.org>
In-Reply-To: <3bbf2fe10806190842s381611del5c5dc27d2dd22a7e@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <200803232309.m2NN96Qa080896@repoman.freebsd.org>
	<3bbf2fe10806190842s381611del5c5dc27d2dd22a7e@mail.gmail.com>
X-Google-Sender-Auth: 18f4386e35b811ab
Cc: cvs-src@freebsd.org, src-committers@freebsd.org, cvs-all@freebsd.org
Subject: Re: cvs commit: src/sys/amd64/amd64 cpu_switch.S
X-BeenThere: cvs-src@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: CVS commit messages for the src tree <cvs-src.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/cvs-src>,
	<mailto:cvs-src-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/cvs-src>
List-Post: <mailto:cvs-src@freebsd.org>
List-Help: <mailto:cvs-src-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/cvs-src>,
	<mailto:cvs-src-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 19 Jun 2008 15:45:58 -0000

2008/6/19, Attilio Rao <attilio@freebsd.org>:
> 2008/3/24, Peter Wemm <peter@freebsd.org>:
>
> > peter       2008-03-23 23:09:06 UTC
>  >
>  >   FreeBSD src repository
>  >
>  >   Modified files:
>  >     sys/amd64/amd64      cpu_switch.S
>  >   Log:
>  >   First pass at (possibly futile) microoptimizing of cpu_switch.  Results
>  >   are mixed.  Some pure context switch microbenchmarks show up to 29%
>  >   improvement.  Pipe based context switch microbenchmarks show up to 7%
>  >   improvement.  Real world tests are far less impressive as they are
>  >   dominated more by actual work than switch overheads, but depending on
>  >   the machine in question, workload, kernel options, phase of moon, etc, a
>  >   few percent gain might be seen.
>  >
>  >   Summary of changes:
>  >   - don't reload MSR_[FG]SBASE registers when context switching between
>  >     non-threaded userland apps.  These typically cost 120 clock cycles each
>  >     on an AMD cpu (less on Barcelona/Phenom).  Intel cores are probably no
>  >     faster on this.
>  >   - The above change only helps unthreaded userland apps that tend to use
>  >     the same value for gsbase.  Threaded apps will get no benefit from this.
>  >   - reorder things like accessing the pcb to be in memory order, to give
>  >     prefetching a better chance of working.  Operations are now in increasing
>  >     memory address order, rather than reverse or random.
>  >   - Push some lesser used code out of the main code paths.  Hopefully
>  >     allowing better code density in cache lines.  This is probably futile.
>  >   - (part 2 of previous item) Reorder code so that branches have a more
>  >     realistic static branch prediction hint.  Both Intel and AMD cpus
>  >     default to predicting branches to lower memory addresses as being
>  >     taken, and to higher memory addresses as not being taken.  This is
>  >     overridden by the limited dynamic branch prediction subsystem.  A trip
>  >     through userland might overflow this.
>  >   - Futule attempt at spreading the use of the results of previous operations
>  >     in new operations.  Hopefully this will allow the cpus to execute in
>  >     parallel better.
>  >   - stop wasting 16 bytes at the top of kernel stack, below the PCB.
>  >   - Never load the userland fs/gsbase registers for kthreads, but preserve
>  >     curpcb->pcb_[fg]sbase as caches for the cpu. (Thanks Jeff!)
>  >
>  >   Microbenchmarking this code seems to be really sensitive to things like
>  >   scheduling luck, timing, cache behavior, tlb behavior, kernel options,
>  >   other random code changes, etc.
>  >
>  >   While it doesn't help heavy userland workloads much, it does help high
>  >   context switch loads a little, and should help those that involve
>  >   switching via kthreads a bit more.
>  >
>  >   A special thanks to Kris for the testing and reality checks, and Jeff for
>  >   tormenting me into doing this. :)
>  >
>  >   This is still work-in-progress.
>
>
> It looks like this patch introduces a regression.
>  In particular, this chunk:
>
>  @@ -181,82 +166,138 @@ sw1:
>         cmpq    %rcx, %rdx
>         pause
>         je      1b
>  -       lfence
>   #endif
>
>  is not totally right as we want to enforce an acq

...an acq memory barrier in order to handle correctly an eventual
thread migration.
We could use this approach, that is what I implemented on ia32 in
order to solve the same problem:

#define	BLOCK_SPIN(reg)							\
		movl		$blocked_lock,%eax ;			\
	100: ;								\
		lock ;							\
		cmpxchgl	%eax,TD_LOCK(reg) ;			\
		jne		101f ;					\
		pause ;							\
		jmp		100b ;					\
	101:

Thanks,
Attilio

[Sorry if I pushed "send" wrongly]

-- 
Peace can only be achieved by understanding - A. Einstein