From owner-freebsd-current@FreeBSD.ORG  Thu May  3 10:28:52 2012
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8E2631065670;
	Thu,  3 May 2012 10:28:52 +0000 (UTC)
	(envelope-from snatreju@googlemail.com)
Received: from mail-bk0-f54.google.com (mail-bk0-f54.google.com
	[209.85.214.54])
	by mx1.freebsd.org (Postfix) with ESMTP id AEA778FC0A;
	Thu,  3 May 2012 10:28:51 +0000 (UTC)
Received: by bkvi17 with SMTP id i17so1687587bkv.13
	for <multiple recipients>; Thu, 03 May 2012 03:28:50 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=googlemail.com; s=20120113;
	h=date:from:to:cc:subject:message-id:references:mime-version
	:content-type:content-disposition:in-reply-to:user-agent;
	bh=zcxGotwkR+HeIibL63OsKVMTBP8p2seI1+OQEQj9D8o=;
	b=wNkb75IfCmlE+XpjZCN6SyYjkcyaIje9Gt4muiJFLEr4pvm6wUitk8Zgx8U2hK41/c
	o3ozDIlKcZw5UUcM5e2YGrh+igWmrE3+AlvXHDgmRKvnPabqT32AG6GxITDl79grxL7w
	qUksHqVGcsoqmPn7Ec3rDqp69XrOLTSz+OsC1tpEuzvY1wmlx89p6ZjlaTBLQUxfajW0
	7oejLBByIU0BK4yRWC78HAQZfqs3K0rb8qC7s9abaFlKEYPq1xjx1FwuhyicuAKEkuXZ
	oVuuYtW7nmyx6DDtUSUjAvQAP0PChgo8pwGU5kE1wy95Fdtn/pk9pHj+3c//AqNWncjn
	vu9g==
Received: by 10.204.131.84 with SMTP id w20mr538439bks.65.1336040930536;
	Thu, 03 May 2012 03:28:50 -0700 (PDT)
Received: from sherwood.local ([89.204.155.34])
	by mx.google.com with ESMTPS id gm18sm9414285bkc.7.2012.05.03.03.28.47
	(version=SSLv3 cipher=OTHER); Thu, 03 May 2012 03:28:49 -0700 (PDT)
Date: Thu, 3 May 2012 12:28:44 +0200
From: Steven Atreju <snatreju@googlemail.com>
To: "K. Macy" <kmacy@freebsd.org>
Message-ID: <20120503102844.GU633@sherwood.local>
References: <20120502182557.GA93838@onelab2.iet.unipi.it>
	<20120502215249.GT633@sherwood.local>
	<CAHM0Q_NNoMrtwcz-xoQ34oVmgJSyjeb_7O6qBHCe16eFeTot_w@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <CAHM0Q_NNoMrtwcz-xoQ34oVmgJSyjeb_7O6qBHCe16eFeTot_w@mail.gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-Mailman-Approved-At: Thu, 03 May 2012 10:42:33 +0000
Cc: Luigi Rizzo <rizzo@iet.unipi.it>, current@freebsd.org, net@freebsd.org
Subject: Re: fast bcopy...
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 03 May 2012 10:28:52 -0000

K. Macy wrote [2012-05-03 02:58+0200]:
> It's highly chipset and processor dependent what works best.

Yes, of course.
Though i was kinda, even shocked, once i've seen this first:

  http://marc.info/?l=dragonfly-commits&m=132241713812022&w=2

So we don't use our assembler version for new gccs and HAMMER or
SSE3+ (the decision for these was rather arbitrarily, except they
were yet existent for an instant implementation).

> Intel now has non-temporal loads and stores which work much
> better in some cases but provide little benefit in others.

Yes, our 2002 tests have shown that these were *extremely*
dependent upon alignment.  (Note: 2002. o-)
Hmm, it doesn't really matter, but i guess this is a good time to
thank the FreeBSD hackers for that FPU stack FILD/FISTP idea!
I'll append the copy related notes of our doc/memperf.txt.
Thanks,

> -Kip

Steven.


I. x86 (AMD Athlon 1600+, 256MB DDR, 133/133 FSB)
-------------------------------------------------

COPY
....

The basic idea is always the same:
- Branch off to REPZ MOVSB if less than 16 bytes to go.
- Align at least one pointer on a nice boundary (&3 or &7).
  (Done by a byte loop; one 4/8 store is more expensive here.)
  We always align the _from pointer due to test experience.
- DEPENDENT.
- Do the remaining maximally 3 bytes in an unrolled MOVSB way.

DEPENDENT:
- !SF_FPU && !defined(SF_X86_MMX): just a matter of REPZ MOVSL.
- Otherwise we use three different loops over 64, 16 and 8 bytes,
  respectively.  If more than 4 bytes remain after that we use one
  additional MOVSL.
  Note that the 8 byte loop is not a loop but executes once only.

  The big loop uses pairs of MOVNTQ/MOVQ, MOVQ/MOVQ and FILD/FISTP, if
  _SSE, _MMX or _FPU, respectively.  The _SSE loop exists in addition and
  is never used if the non-aligned (the _to) pointer is not also aligned.
  The two smaller ones never use SSE's non-temporal moves; this way we
  simply can go no matter wether the to pointer is aligned or not.
  Tests demonstrated that non-temporal is no win for them anyway.

  At the end we add additional SFENCE (if _SSE) and EMMS (_MMX) or FEMMS
  (if _3DNOW) to serialize the non-temporal moves and clear the MMX state,
  respectively.  The SFENCE should not be needed, however.
  Prefetching is not used (very bad on Athlon (or i don't understand it)).

1. !_MMX && !_FPU
2. _MMX
3. _FPU (thanks to the FreeBSD crew for this idea!)
4. _MMX+_3DNOW+_SSE implementation (all we have).
   ([*] times in brackets show which time has been measured if the from
   pointer alignment loop has a leading '.ALIGN 2' statement; note
   especially the value for 4096...  note this value in general.)

UNT: unaligned pointers, to pointer alignment goal
UNF: unaligned pointers, from pointer alignment goal
1000 loops; times in (averaged) microseconds

P.S.: 03-04-01: SSE stuff disabled because speed for smaller ranges
considered to be more important than for large and even more largest ranges.
(And small difference for non-perfect ranges and non-aligned pointers.)

---------------------------------------------------------------------------
|bytes|   1./ UNT/ UNF |   2./ UNT/ UNF |   3./ UNT/ UNF |   4.[*]  / UNF |
|--------------------------------------------------------------------------
|16   |   34/    /     |   19/    /  37 |   21/    /  37 |   24[ 26]/  37 |
|15   |   40/    /     |   39/    /  35 |   37/    /  35 |   38[ 39]/  35 |
|32   |   36/    /     |   23/    /  30 |   23/    /  30 |   27[ 30]/  33 |
|31   |   43/    /     |   37/    /  28 |   36/    /  28 |   38[ 42]/  31 |
|64   |   45/    /     |   17/    /  38 |   17/    /  36 |   21[ 23]/  39 |
|63   |   50/    /     |   46/    /  35 |   44/    /  34 |   47[ 50]/  37 |
|128  |   59/  70/  74 |   31/    /  45 |   34/    /  47 |   34[ 36]/  50 |
|127  |   67/  82/  62 |   53/    /  45 |   51/    /  44 |   62[ 63]/  50 |
|256  |   89/ 111/ 108 |   52/    /  74 |   53/    /  77 |   50[ 50]/  76 |
|255  |   99/ 123/  96 |   67/    /  73 |   73/    /  75 |   68[ 70]/  74 |
|512  |  151/ 197/ 177 |   95/    / 131 |   98/    / 137 |   84[103]/ 137 |
|511  |  158/ 208/ 166 |  100/    / 132 |  117/    / 134 |   99[112]/ 135 |
|1024 |  274/ 395/ 314 |  179/    / 255 |  211/    / 270 |  166[207]/ 257 |
|1023 |  280/ 408/ 303 |  196/    / 253 |  225/    / 267 |  184[185]/ 253 |
|2048 |  579/ 765/ 966 |  350/    / 485 |  394/    / 511 |  389[388]/ 486 |
|2047 |  585/ 777/ 942 |  368/    / 484 |  410/    / 520 |  323[398]/ 484 |
|4096 | 1009/1385/1140 |  704/    /1036 |  761/    /1040 |  671[583]/1038 |
|4095 | 1027/1386/1130 |  721/    /1034 |  776/    /1037 |  602[604]/1035 |
|--------------------------------------------------------------------------

P.S.: ooops - i've really forgotten that the SSE stuff has been
completely disabled at a later time!  I guess we'll have to redo
some testing eventually!