From owner-freebsd-current Fri Apr 5 01:35:45 1996 Return-Path: owner-current Received: (from root@localhost) by freefall.freebsd.org (8.7.3/8.7.3) id BAA07169 for current-outgoing; Fri, 5 Apr 1996 01:35:45 -0800 (PST) Received: from silvia.HIP.Berkeley.EDU (silvia.HIP.Berkeley.EDU [136.152.64.181]) by freefall.freebsd.org (8.7.3/8.7.3) with ESMTP id BAA07119 for ; Fri, 5 Apr 1996 01:35:29 -0800 (PST) Received: (from asami@localhost) by silvia.HIP.Berkeley.EDU (8.7.5/8.6.9) id BAA24263; Fri, 5 Apr 1996 01:35:16 -0800 (PST) Date: Fri, 5 Apr 1996 01:35:16 -0800 (PST) Message-Id: <199604050935.BAA24263@silvia.HIP.Berkeley.EDU> To: current@freebsd.org CC: nisha@cs.berkeley.edu, tege@matematik.su.se, hasty@rah.star-gate.com Subject: fast memory copy for large data sizes From: asami@cs.berkeley.edu (Satoshi Asami) Sender: owner-current@freebsd.org X-Loop: FreeBSD.org Precedence: bulk We've put together a fast memory copy that uses floating point registers to speed up large transfers. The original idea was taken from Amancio Hasty's old post to use floating point registers to move 8 bytes at a time. (We tried using integer registers too but with our wits we could only get 10MB/s less than the FP case.) By the way, we plugged this thing in as a replacement to copyin/copyout and our ccd testing machine, (striping disk driver, see http://stampede.cs.berkeley.edu/ccd/ for details) and maximum read performance improved from 21MB/s to 24MB/s using 9 disks. But that's only to our interest, so here's a comparison with the libc bcopy() (which is essentially the same code as the stock copyin/copyout). Here are the kind of numbers we are seeing, and hope you will see, if you run the program attached at the end of this mail: 90MHz Pentium (silvia), SiS chipset, 256KB cache: size libc ours 32 15.258789 MB/s 6.103516 MB/s 64 20.345052 MB/s 15.258789 MB/s 128 17.438616 MB/s 15.258789 MB/s 256 17.438616 MB/s 20.345052 MB/s 512 17.438616 MB/s 22.194602 MB/s 1024 17.438616 MB/s 23.251488 MB/s 2048 17.755682 MB/s 23.818598 MB/s 4096 17.836758 MB/s 23.390719 MB/s 8192 17.715420 MB/s 24.112654 MB/s 16384 17.341842 MB/s 24.338006 MB/s 32768 17.361111 MB/s 25.080257 MB/s 65536 17.715420 MB/s 24.423603 MB/s 131072 17.373176 MB/s 25.237230 MB/s 262144 17.553714 MB/s 24.723101 MB/s 524288 17.401594 MB/s 24.951345 MB/s 1048576 14.804506 MB/s 24.252419 MB/s 2097152 17.732383 MB/s 24.392326 MB/s 4194304 17.219484 MB/s 23.491825 MB/s 133MHz Pentium (sunrise), Triton chipset, 512KB (pipeline burst) cache: size libc ours 32 N/A 30.517578 MB/s 64 61.035156 MB/s 30.517578 MB/s 128 40.690104 MB/s 40.690104 MB/s 256 40.690104 MB/s 40.690104 MB/s 512 40.690104 MB/s 48.828125 MB/s 1024 40.690104 MB/s 51.398026 MB/s 2048 39.859694 MB/s 51.398026 MB/s 4096 39.859694 MB/s 52.083333 MB/s 8192 39.457071 MB/s 52.787162 MB/s 16384 39.556962 MB/s 52.966102 MB/s 32768 39.506953 MB/s 53.146259 MB/s 65536 39.457071 MB/s 53.282182 MB/s 131072 39.457071 MB/s 53.327645 MB/s 262144 39.345294 MB/s 53.350405 MB/s 524288 39.044198 MB/s 53.430220 MB/s 1048576 38.086533 MB/s 53.447354 MB/s 2097152 37.706680 MB/s 53.387433 MB/s 4194304 37.628643 MB/s 53.280763 MB/s As you can see, from a certain size and onwards, it is much faster than the libc version. ("size" is in bytes.) The program allocates two 4MB buffers and calls libc's bcopy (which is essentially a string move using rep/movsl; see below for more on this) or our enhanced version (called "unrolled") repeatedly with "size" increments until all data is copied. The test program itself is taken from lmbench, and I added some random garbage initialization and post-testing to make sure all the data is correctly copied. I'll include the C program too, but it won't do you much good other than as a reference to the testing method because we rewrote the function "unrolled" in assembly language (why such a simple program gets so mungled up by the compiler is a mystery). The "meat" of each code is included here in plaintext. First, the libc bcopy: === movl %ecx,%edx cld /* clear direction flag: copy forward */ shrl $2,%ecx /* count /= 4 for word copy */ rep movsl movl %edx,%ecx andl $3,%ecx /* count %= 4 for the remaining bytes */ rep movsb === This is not exactly what it is doing but I've taken the liberty to simplify it for discussion purposes. The "movs*" instruction, with a "rep" prefix, will copy %ecx ("count") things from %esi ("source index") to %edi ("destination index"). The "movsl" is for 32-bit moves (hence the shift right 2 in the count) and "movsb" is for 8-bit moves (to copy up to 3 bytes left). Of course, the whole thing could be done as: === rep movsb === but that's going to be a little slow 'cause it's byte-by-byte moves. Here's ours: === cmpl $63,%ecx /* if less than 64 bytes, go to end */ jbe L54 # movl %cr0,%edx # movl $8, %eax /* CR0_TS */ # not %eax # andl %eax,%edx /* clear CR0_TS */ # movl %edx,%cr0 subl $108,%esp /* save all floating point registers */ fsave (%esp) .align 2,0x90 L55: fildq 0(%esi) /* load quadword (64-bit) int into FP registers */ fildq 8(%esi) fildq 16(%esi) fildq 24(%esi) fildq 32(%esi) fildq 40(%esi) fildq 48(%esi) fildq 56(%esi) fxch %st(7) /* exchange top of stack with 8th position */ fistpq 0(%edi) /* store quadword */ fxch %st(5) fistpq 8(%edi) fxch %st(3) fistpq 16(%edi) fxch %st(1) fistpq 24(%edi) fistpq 32(%edi) fistpq 40(%edi) fistpq 48(%edi) fistpq 56(%edi) addl $-64,%ecx addl $64,%esi addl $64,%edi cmpl $63,%ecx ja L55 frstor (%esp) /* restore FP registers */ addl $108,%esp # andl $8,%edx # movl %cr0,%eax # orl %edx, %eax /* reset CR0_TS to the original value */ # movl %eax,%cr0 L54: cld /* do the rest; at most 63 bytes so we */ rep /* don't really care about speed here */ movsb === (Don't worry about the commented out lines for now, those were necessary to temporarily enable FP operations in the kernel.) This routine works by loading eight bytes at a time into a floating point register using the fildq (integer load quadword) operation, and storing them with the fistpq (integer store and pop quadword) operation. (You can't use fld and fst because they will trap on illegal (as a floating point number) bit patterns -- by the way, the Pentium FP regs are 80 bits with a 64-bit mantissa so there's no loss of data by using the integer load/store.) The Pentium FP unit is a stack of 8 registers, hence the "pop" and fxch thingies. Also, we save the FP state using fsave and frstor if we decide to use FP regs. Since there are 108 bytes to write/read in this case, the use of this should be limited to large transfers. I'd like people to try the following tarball on their machines, so we can see if it really works for everybody and not just in California. Please type "make" and it will compile & run the tests. The output already formatted (like the table you see above) so you can easily forward it to the list. Of course, before this to go into the library/kernel, we need to make sure it is run only on machines with the FP unit (prob. only on Pentiums), make it work with overlapping copies, etc., but I wanted to see what people think about it first. Satoshi ------- begin 644 bcopy.tar.gz M'XL(`#KD9#$"`^T\:W/;-K;Y2OX*5+%;29%D/B59KK-U4V\F4\?VV,XVO4U& MI4C*8D.16I)RY+2YO_V>@Q=!2G:<;1XS=\UI9>(`.#@X;X!`)GZZN-YY\%D? MXA@#UR4/B&GU;1O^XF/PO[Q`^J9MFGW+Z#N$F(9K60^(^^`+/,N\\#)"'GBY M-X]N:?=V%H;Q@_]WSX3*?Q[.Q_ZBYW^>,4P#I.K<*/^^85A<_J[A.'V0OV6A M_(U[^7_V9Z>MDS:9O!T+%2!=DD?S11P2@*39-4$%(468%]!P1W\8)7Z\#$*M M443S*+FT>GZC!'Z?7^<[4!'V9H]U_6$03J,D)!>_GAZBH(-T.8G#"A@$3^(T MN:0_HD8[?_8_AUH>O0O3:1-;M11<)Q<'1UA/''/7L0U'VVD3Y_F/2)M^E4:! MMDRR-([#H-G:HP"BA?-%<8U%?>Y%2=/S.\2[:NG"^?@S5(!VV[O:T_^4T"@I MZ-]E'OIY)T[3!7LC48=,KH$?>[(IGU\[SP!S.\B+/1VHR*/+)`S8[(KY`F!L M_MI\`N]YD2W]@B"OKKR8%#"VAD,>`Y5:-"5`)?EFG]@M\J>N:=-%!I739EX$ M899U`(+C-E[DWF4X(MLY06:1.)KX?XGI$R^F!`"`O[U*&CCQWXS7P`I-"U=1 MT33Q];VNT1F1?>(5:=2$-B9M@W0T8R\O.(CL[Y/OWGS7(G_]1=;@/W_7:@%: MAJF]#T[<)$SR91:225K, MJ!X!,H*3\P"8I`7)WWJ+!0R;+@LZ/1HVF]3@:^/4F&PAD]$P&U3#.8^31HX@U*X=Y1`?80;5O1W3,&D3,1D-KUJBK`IQYD2Z0$;1"^K!] M@XHIC//PHZB1GO43$23');7G5BJ8-U\C88V"*@%R\C4BWM^D>A7%HT[L1N4# MRUFO1/Q0B>Y,*![U'".*GFP''9+/TF4=)>6SA2QHJ=?8PW"F\XH!9N2&-[2B#$@C)&&%@@,%+` M3DX-:3[!_W&,)L72E?A:`(3D#Y^>T<)QL[!89@EI&E``7\FRCY\G4606[%X0QZ#)^T#X:BRJ-YU0V"MNDJK=6-! M-X4>N;]>A9Z,5X&B`[_9K%0[5OT7Z"K-R;1*)@,ST>0L:`.2A9<1B@*C@=&! M'Q-_+/RQ\L+CC,D!I@!8'&`)@,T! MM@`X'.`(@,L!K@#T.:`O``,.&+SF`0T(`!`2)``F!Y@"8'&`)0`V!]@"X'"` M(P`N![@"T.>`O@`,.&"P)S.@;LDFZB]!G#3DB]#+WZEH:2J%(D1'`!*DTN), M;^?4DVF*`>:_=:W7U`@?W#__Q8^Z_A_V\J^P_C==:V"+_1_#&>#^CV7W^_?K M_R_Q:+TIN`BM(9;_#?W2]ZVQG\X7``]Z(WT\'E\F2PD9^R.]5X2K0M=Z=/%` M++UW&:>3F(QI0@?PXGH1:KS8^6&Z3/PB2A.=`4:0B2_S64RVP\D"LI3T"E_S M18>56=V6(=]8KW%Q!:X-8_M875I`\A)Z5R'U:_K1-#0!>P\7PW)X!'8E9>L4 MIPN5X'11I3==?(#<');T9&O80=!FXM/%C;2OSW"M6XQ947?8Q-%:,(JW*NG! M=X:VN`(Z.'%*8W\E"?974`Z@G&.'+5>4D'REEK<.5GPHUKHO2JRUK*7$-7EW M;]49MC9V\X(`R'(J+9YGR[9'PY%0B)+SDS7.VPKG&2,1V.4\/GIB8+V7^U'T MT=M-IO7*:``*4T'!=DT8W%+@VV5S6P'C$`SJ*-"[KJ,$1E<=Z*:5#&M<*MO1 MDSYVH]MWQLKH&"O'M"F+*XT&:XVL<.@,C0UJBYN/I=9B255:+-_)USA&Q=F` MQD3E>Q[=JMJ3#:J-W9D7&7,2_?D"QK%Y^S]")S_ M\Y$YZ`M\?!DI&YBB`O?+2@8!UW8-'<:0TW?$**PY;DJJYM`51GA#>TR%A5D( MS\SF9!H#;EQT5M8=$#@5!`-7]$\0@0EDHQHS?V<:"FE0.?IH]*:QJ]+G?#Q] M@PI]-I#A2/JL*GVVJ4Q02H`&&60571>6S!=%Y:Q@8- MY[RR:LRBF]6RMN[90.TLJ785;WH#4@$P:J/@,D\)!T$DC)5Y``B[/!2PLLOS M(#\.I!BE?63A(J0\RB?,BIP^[?=N@C.0)LXY+:%L3+^6=4(KZ#\2RF$)@]UL ME9;D!L_HZRK&)3>I>*D_8NSJ;!`=SCB?93S<88DCI*)'_M1UL70)BK"H[1A- MVH5EP3(YW&BH2@ZMFAA+E!5E#90&P:HFS)HU"<2"\'5UKC'E$GC2AQDYHQN5 MDR7@:N2U*EX(':I3.M3Y`G3!7>>9-?J/I68;GT)JMO7UI28RY4\B.!MR3]OX MCP1WJR`VY!2;Q>)^$K$,OKY8Z!;_IY')`&;D?H1,I.6`T8SJO+_%1$I645DX MQH8P)UVYRK(-(G"L$8^'PD-[,1T,`J1'`V0>\5S),:4_K_AXE>EE?;"Q?CUV M0>AQZA&."Z//6`$#CZ0$@E(8-\M%81#(Q0%;<=!6IC&@A74=O&%VFZVGM]W: MDK^2"M>RWHT.4MK`$DRWG!]=Q*8+J0_I0BYAV6L0U;>RN&9=K M2+,&N0YOZNT1:`DZNCNJ3]=5ILNH1V!73JP^X3+&B#D+B#IM`;OC%EE]?Z&V MIR!!M3V%.N,@\1-S[]L\$:2,<\'^]8=\>#\SF*9S`"@BU2$\I/'DS!A?G--3 M:!H>RJ#L>\A6.M)$L*$/7,PJS95]31A!%UIN&J6>`]^%GJ\+$@4QC>+@W\00 M+I$5A]4BG:]2MIQJV;:J9:>&SJGAFLUIIEK27W&SB`DJT"'*,.&-8`KAR!&T#?X4)G95I$W5&*="NJIB%H'"Y( M9)KA!_V:AY/R$VI`_5:I/5R=J)HP;P0:(/4I"_.P$&I2I/340)I%EU$"CNK* MBY=A17D\H3R@L2.^-(*E$-5T6`FMKZ`5_W:K4^LK5B[M%>%=U:#KMLXV@82= M8TFU<2S?Q;,-*]&FKO9#-3E8\>1`1.(56Z2Q1('6T"6:.^0"5*=$'ZD`*LI`NOQMR9YO!.$5SL);I7] M1;RW;TCC3W[4[%6#;`\"0K9W<_H]AI!7C0YT[I!76V[K?:.*47YU^B!6@>]5 M@O@DKFC_]W"UR!#7JS:Q?M>#-`GU_UK[?^Z]"?$Q_X'ARO,??=>E]S\< M]][^OXC]$R%[>NIV(F][Y+H.R[D1X><^B)?GX7P"+_0JB(XI-GZOS.;<+GLY M?Q%V"FU8UY%LH>OB3<)\7?N!^9S'C_E847(I:RE1U^!P(%F;AEF8^&%#URY] MGW1/+++UF'3/@4Y.VFB-`O:B$*`,QGO=<330DB+RZ8@IV?I!H!X*W,,/(!]R M[,@]@-Z(%NLE3A4CA.@$T5'9-'"CC8BHK?]M^Y=W>;Z&_5N6V1?QWW0&)CW_ M9=_;_Q>\_[7U+!@1I@0]OW-%S%Z?F+N[[HXQW+%<8M@CVX#_2#PGAZL%V8(^ MV.T)Z`ZLEV8%:?HM[."0(R_+KLES_U_I=0]DRQL>0&A.E\5B69#+-,QQL<7. MEODO!%3.3=?#L-[)^^:*^@*/_1*"?D',MH!6'GUK.<(&-G4B4 MD'GD9RGP*DV"G%&QI)<(Z>T.2D7]CATP<8I'T6^@CJTK;R0.5KMX)JKY+<73 M(64/.3/UA@-M)63;%N)$*8L*?O7I/3^5S[!SY(4)_QOE58/:5-J\61O;M0M# MO6!`J[J/^BPK79:5/G@AIM;O>Z+0NY&8;K=3'PQO M)S#.4$;_3O[/W>YG&N,#\=^V^H82_^GZWW#M^_C_A>,_5P(:_AT>_OL[ MEDDLG@FX]W%D]/QRX,G%YJ]V[T/]_S7X=FY)GN9,AMBFING_INP(#D` M(=7RK\AD.86E&TD7F#Z\\W`W/-=5?._*09:^!?SIY='&KF M>O.??CG3K#7P\;H%?B/+_ZI*>E7JU)Y?G%X"I5]7L6UEYS_>N[*=NP" M\Z1#XC!I:;`8S,,"BP:#E.W8]>`.":""-O2AC">BL2R;0;X:A"MLY[!7C1EE\_S9TX.CL^<=@L<>6WO$B[ULW@17 MWB9]FB_1?XV!7HWG_Q)#>>&7#<,R-5[G!7^`*Y;%;(F'S65Q`EG@VR@H9A+R M9B)?Y\HK)H5**8XC64*-*#&R=*LE$]NG81)F7@&3GF;IG,SSR]X*+VOZ,SQ/ MSKU/0&:@GB-ZF3R]S+PY06L8GYZ=/,5KC.B8KL(L!YUC%:B[]'XC7LJ#/V>G M3ZC]-+$,V3LQ64JEO6?O]*^].X!L"&DZC4,O#TF0TCOH81#Q:[&X78,I+]YC M?>OED,D*VI