From owner-freebsd-current@FreeBSD.ORG Thu May 3 10:28:52 2012 Return-Path: Delivered-To: current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8E2631065670; Thu, 3 May 2012 10:28:52 +0000 (UTC) (envelope-from snatreju@googlemail.com) Received: from mail-bk0-f54.google.com (mail-bk0-f54.google.com [209.85.214.54]) by mx1.freebsd.org (Postfix) with ESMTP id AEA778FC0A; Thu, 3 May 2012 10:28:51 +0000 (UTC) Received: by bkvi17 with SMTP id i17so1687587bkv.13 for ; Thu, 03 May 2012 03:28:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20120113; h=date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=zcxGotwkR+HeIibL63OsKVMTBP8p2seI1+OQEQj9D8o=; b=wNkb75IfCmlE+XpjZCN6SyYjkcyaIje9Gt4muiJFLEr4pvm6wUitk8Zgx8U2hK41/c o3ozDIlKcZw5UUcM5e2YGrh+igWmrE3+AlvXHDgmRKvnPabqT32AG6GxITDl79grxL7w qUksHqVGcsoqmPn7Ec3rDqp69XrOLTSz+OsC1tpEuzvY1wmlx89p6ZjlaTBLQUxfajW0 7oejLBByIU0BK4yRWC78HAQZfqs3K0rb8qC7s9abaFlKEYPq1xjx1FwuhyicuAKEkuXZ oVuuYtW7nmyx6DDtUSUjAvQAP0PChgo8pwGU5kE1wy95Fdtn/pk9pHj+3c//AqNWncjn vu9g== Received: by 10.204.131.84 with SMTP id w20mr538439bks.65.1336040930536; Thu, 03 May 2012 03:28:50 -0700 (PDT) Received: from sherwood.local ([89.204.155.34]) by mx.google.com with ESMTPS id gm18sm9414285bkc.7.2012.05.03.03.28.47 (version=SSLv3 cipher=OTHER); Thu, 03 May 2012 03:28:49 -0700 (PDT) Date: Thu, 3 May 2012 12:28:44 +0200 From: Steven Atreju To: "K. Macy" Message-ID: <20120503102844.GU633@sherwood.local> References: <20120502182557.GA93838@onelab2.iet.unipi.it> <20120502215249.GT633@sherwood.local> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Mailman-Approved-At: Thu, 03 May 2012 10:42:33 +0000 Cc: Luigi Rizzo , current@freebsd.org, net@freebsd.org Subject: Re: fast bcopy... X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 03 May 2012 10:28:52 -0000 K. Macy wrote [2012-05-03 02:58+0200]: > It's highly chipset and processor dependent what works best. Yes, of course. Though i was kinda, even shocked, once i've seen this first: http://marc.info/?l=dragonfly-commits&m=132241713812022&w=2 So we don't use our assembler version for new gccs and HAMMER or SSE3+ (the decision for these was rather arbitrarily, except they were yet existent for an instant implementation). > Intel now has non-temporal loads and stores which work much > better in some cases but provide little benefit in others. Yes, our 2002 tests have shown that these were *extremely* dependent upon alignment. (Note: 2002. o-) Hmm, it doesn't really matter, but i guess this is a good time to thank the FreeBSD hackers for that FPU stack FILD/FISTP idea! I'll append the copy related notes of our doc/memperf.txt. Thanks, > -Kip Steven. I. x86 (AMD Athlon 1600+, 256MB DDR, 133/133 FSB) ------------------------------------------------- COPY .... The basic idea is always the same: - Branch off to REPZ MOVSB if less than 16 bytes to go. - Align at least one pointer on a nice boundary (&3 or &7). (Done by a byte loop; one 4/8 store is more expensive here.) We always align the _from pointer due to test experience. - DEPENDENT. - Do the remaining maximally 3 bytes in an unrolled MOVSB way. DEPENDENT: - !SF_FPU && !defined(SF_X86_MMX): just a matter of REPZ MOVSL. - Otherwise we use three different loops over 64, 16 and 8 bytes, respectively. If more than 4 bytes remain after that we use one additional MOVSL. Note that the 8 byte loop is not a loop but executes once only. The big loop uses pairs of MOVNTQ/MOVQ, MOVQ/MOVQ and FILD/FISTP, if _SSE, _MMX or _FPU, respectively. The _SSE loop exists in addition and is never used if the non-aligned (the _to) pointer is not also aligned. The two smaller ones never use SSE's non-temporal moves; this way we simply can go no matter wether the to pointer is aligned or not. Tests demonstrated that non-temporal is no win for them anyway. At the end we add additional SFENCE (if _SSE) and EMMS (_MMX) or FEMMS (if _3DNOW) to serialize the non-temporal moves and clear the MMX state, respectively. The SFENCE should not be needed, however. Prefetching is not used (very bad on Athlon (or i don't understand it)). 1. !_MMX && !_FPU 2. _MMX 3. _FPU (thanks to the FreeBSD crew for this idea!) 4. _MMX+_3DNOW+_SSE implementation (all we have). ([*] times in brackets show which time has been measured if the from pointer alignment loop has a leading '.ALIGN 2' statement; note especially the value for 4096... note this value in general.) UNT: unaligned pointers, to pointer alignment goal UNF: unaligned pointers, from pointer alignment goal 1000 loops; times in (averaged) microseconds P.S.: 03-04-01: SSE stuff disabled because speed for smaller ranges considered to be more important than for large and even more largest ranges. (And small difference for non-perfect ranges and non-aligned pointers.) --------------------------------------------------------------------------- |bytes| 1./ UNT/ UNF | 2./ UNT/ UNF | 3./ UNT/ UNF | 4.[*] / UNF | |-------------------------------------------------------------------------- |16 | 34/ / | 19/ / 37 | 21/ / 37 | 24[ 26]/ 37 | |15 | 40/ / | 39/ / 35 | 37/ / 35 | 38[ 39]/ 35 | |32 | 36/ / | 23/ / 30 | 23/ / 30 | 27[ 30]/ 33 | |31 | 43/ / | 37/ / 28 | 36/ / 28 | 38[ 42]/ 31 | |64 | 45/ / | 17/ / 38 | 17/ / 36 | 21[ 23]/ 39 | |63 | 50/ / | 46/ / 35 | 44/ / 34 | 47[ 50]/ 37 | |128 | 59/ 70/ 74 | 31/ / 45 | 34/ / 47 | 34[ 36]/ 50 | |127 | 67/ 82/ 62 | 53/ / 45 | 51/ / 44 | 62[ 63]/ 50 | |256 | 89/ 111/ 108 | 52/ / 74 | 53/ / 77 | 50[ 50]/ 76 | |255 | 99/ 123/ 96 | 67/ / 73 | 73/ / 75 | 68[ 70]/ 74 | |512 | 151/ 197/ 177 | 95/ / 131 | 98/ / 137 | 84[103]/ 137 | |511 | 158/ 208/ 166 | 100/ / 132 | 117/ / 134 | 99[112]/ 135 | |1024 | 274/ 395/ 314 | 179/ / 255 | 211/ / 270 | 166[207]/ 257 | |1023 | 280/ 408/ 303 | 196/ / 253 | 225/ / 267 | 184[185]/ 253 | |2048 | 579/ 765/ 966 | 350/ / 485 | 394/ / 511 | 389[388]/ 486 | |2047 | 585/ 777/ 942 | 368/ / 484 | 410/ / 520 | 323[398]/ 484 | |4096 | 1009/1385/1140 | 704/ /1036 | 761/ /1040 | 671[583]/1038 | |4095 | 1027/1386/1130 | 721/ /1034 | 776/ /1037 | 602[604]/1035 | |-------------------------------------------------------------------------- P.S.: ooops - i've really forgotten that the SSE stuff has been completely disabled at a later time! I guess we'll have to redo some testing eventually!