Date: Thu, 3 May 2012 12:28:44 +0200 From: Steven Atreju <snatreju@googlemail.com> To: "K. Macy" <kmacy@freebsd.org> Cc: Luigi Rizzo <rizzo@iet.unipi.it>, current@freebsd.org, net@freebsd.org Subject: Re: fast bcopy... Message-ID: <20120503102844.GU633@sherwood.local> In-Reply-To: <CAHM0Q_NNoMrtwcz-xoQ34oVmgJSyjeb_7O6qBHCe16eFeTot_w@mail.gmail.com> References: <20120502182557.GA93838@onelab2.iet.unipi.it> <20120502215249.GT633@sherwood.local> <CAHM0Q_NNoMrtwcz-xoQ34oVmgJSyjeb_7O6qBHCe16eFeTot_w@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
K. Macy wrote [2012-05-03 02:58+0200]: > It's highly chipset and processor dependent what works best. Yes, of course. Though i was kinda, even shocked, once i've seen this first: http://marc.info/?l=dragonfly-commits&m=132241713812022&w=2 So we don't use our assembler version for new gccs and HAMMER or SSE3+ (the decision for these was rather arbitrarily, except they were yet existent for an instant implementation). > Intel now has non-temporal loads and stores which work much > better in some cases but provide little benefit in others. Yes, our 2002 tests have shown that these were *extremely* dependent upon alignment. (Note: 2002. o-) Hmm, it doesn't really matter, but i guess this is a good time to thank the FreeBSD hackers for that FPU stack FILD/FISTP idea! I'll append the copy related notes of our doc/memperf.txt. Thanks, > -Kip Steven. I. x86 (AMD Athlon 1600+, 256MB DDR, 133/133 FSB) ------------------------------------------------- COPY .... The basic idea is always the same: - Branch off to REPZ MOVSB if less than 16 bytes to go. - Align at least one pointer on a nice boundary (&3 or &7). (Done by a byte loop; one 4/8 store is more expensive here.) We always align the _from pointer due to test experience. - DEPENDENT. - Do the remaining maximally 3 bytes in an unrolled MOVSB way. DEPENDENT: - !SF_FPU && !defined(SF_X86_MMX): just a matter of REPZ MOVSL. - Otherwise we use three different loops over 64, 16 and 8 bytes, respectively. If more than 4 bytes remain after that we use one additional MOVSL. Note that the 8 byte loop is not a loop but executes once only. The big loop uses pairs of MOVNTQ/MOVQ, MOVQ/MOVQ and FILD/FISTP, if _SSE, _MMX or _FPU, respectively. The _SSE loop exists in addition and is never used if the non-aligned (the _to) pointer is not also aligned. The two smaller ones never use SSE's non-temporal moves; this way we simply can go no matter wether the to pointer is aligned or not. Tests demonstrated that non-temporal is no win for them anyway. At the end we add additional SFENCE (if _SSE) and EMMS (_MMX) or FEMMS (if _3DNOW) to serialize the non-temporal moves and clear the MMX state, respectively. The SFENCE should not be needed, however. Prefetching is not used (very bad on Athlon (or i don't understand it)). 1. !_MMX && !_FPU 2. _MMX 3. _FPU (thanks to the FreeBSD crew for this idea!) 4. _MMX+_3DNOW+_SSE implementation (all we have). ([*] times in brackets show which time has been measured if the from pointer alignment loop has a leading '.ALIGN 2' statement; note especially the value for 4096... note this value in general.) UNT: unaligned pointers, to pointer alignment goal UNF: unaligned pointers, from pointer alignment goal 1000 loops; times in (averaged) microseconds P.S.: 03-04-01: SSE stuff disabled because speed for smaller ranges considered to be more important than for large and even more largest ranges. (And small difference for non-perfect ranges and non-aligned pointers.) --------------------------------------------------------------------------- |bytes| 1./ UNT/ UNF | 2./ UNT/ UNF | 3./ UNT/ UNF | 4.[*] / UNF | |-------------------------------------------------------------------------- |16 | 34/ / | 19/ / 37 | 21/ / 37 | 24[ 26]/ 37 | |15 | 40/ / | 39/ / 35 | 37/ / 35 | 38[ 39]/ 35 | |32 | 36/ / | 23/ / 30 | 23/ / 30 | 27[ 30]/ 33 | |31 | 43/ / | 37/ / 28 | 36/ / 28 | 38[ 42]/ 31 | |64 | 45/ / | 17/ / 38 | 17/ / 36 | 21[ 23]/ 39 | |63 | 50/ / | 46/ / 35 | 44/ / 34 | 47[ 50]/ 37 | |128 | 59/ 70/ 74 | 31/ / 45 | 34/ / 47 | 34[ 36]/ 50 | |127 | 67/ 82/ 62 | 53/ / 45 | 51/ / 44 | 62[ 63]/ 50 | |256 | 89/ 111/ 108 | 52/ / 74 | 53/ / 77 | 50[ 50]/ 76 | |255 | 99/ 123/ 96 | 67/ / 73 | 73/ / 75 | 68[ 70]/ 74 | |512 | 151/ 197/ 177 | 95/ / 131 | 98/ / 137 | 84[103]/ 137 | |511 | 158/ 208/ 166 | 100/ / 132 | 117/ / 134 | 99[112]/ 135 | |1024 | 274/ 395/ 314 | 179/ / 255 | 211/ / 270 | 166[207]/ 257 | |1023 | 280/ 408/ 303 | 196/ / 253 | 225/ / 267 | 184[185]/ 253 | |2048 | 579/ 765/ 966 | 350/ / 485 | 394/ / 511 | 389[388]/ 486 | |2047 | 585/ 777/ 942 | 368/ / 484 | 410/ / 520 | 323[398]/ 484 | |4096 | 1009/1385/1140 | 704/ /1036 | 761/ /1040 | 671[583]/1038 | |4095 | 1027/1386/1130 | 721/ /1034 | 776/ /1037 | 602[604]/1035 | |-------------------------------------------------------------------------- P.S.: ooops - i've really forgotten that the SSE stuff has been completely disabled at a later time! I guess we'll have to redo some testing eventually!
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120503102844.GU633>