Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 11 Jun 2003 11:12:22 -0500
From:      "Aron J. Silverton" <ajs@labs.mot.com>
To:        "'Peter Grehan'" <grehan@freebsd.org>, <Sean_Welch@alum.wofford.org>
Cc:        freebsd-ppc@freebsd.org
Subject:   RE: Altivec use on FreeBSD?
Message-ID:  <4D87884B6A6D4E438A8592BCC9C85DCA079C31D9@il02exm06.corp.mot.com>
In-Reply-To: <3EE06E73.6B9F9F33@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
owner-freebsd-ppc@freebsd.org wrote:
>> Any chance the library mentioned on the site below could be
>> used/adapted?
>=20
>  Possibly, but it would depend on the license (I haven't looked).
>=20
>  NetBSD can use the Altivec unit for page-zeroing.
>=20
> later,
>=20
> Peter.
> _______________________________________________
> freebsd-ppc@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-ppc To unsubscribe,
> send any mail to "freebsd-ppc-unsubscribe@freebsd.org"=20

I'm dumping the README below this message.  There are a lot of links
mentioned that provide additional information.  I clicked through all of =
the
license stuff, but here is one bit:

LICENSE GRANT. Motorola grants to you, free of charge, the =
non-exclusive,
non-transferable right (1) to use the Software, (2) to reproduce the
Software, (3) to prepare derivative works of the Software, (4) to =
distribute
the Software and derivative works thereof in source (human-readable) =
form
and object (machine-readable) form, and (5) to sublicense to others the
right to use the distributed Software. If you violate any of the terms =
or
restrictions of this Agreement, Motorola may immediately terminate this
Agreement, and require that you stop using and delete all copies of the
Software in your possession or control.

COPYRIGHT. The Software is licensed to you, not sold. Motorola owns the
Software, and United States copyright laws and international treaty
provisions protect the Software. Therefore, you must treat the Software =
like
any other copyrighted material (e.g., a book or musical recording). You =
may
not use or copy the Software for any other purpose than what is =
described in
this Agreement. Except as expressly provided herein, Motorola does not =
grant
to you any express or implied rights under any Motorola or third party
patents, copyrights, trademarks, or trade secrets. Additionally, you =
must
reproduce and apply any copyright or other proprietary rights notices
included on or embedded in the Software to any copies or derivative =
works
made thereof, in whole or in part, if any.

Feel free to look at the whole thing at http://www.motorola.com/altivec.
The license is not in the archive, you have to click through it to =
download
the library.  I'm looking into getting a Sandpoint board or two that =
would
have G4 with AltiVec on them.  The boards that I have today and that I
mentioned previously in regards to FreeBSD-PowerPC, don't.

Aron

README:

//------------------------------------------------------------------
// file:  readme.txt
//    Readme to accompany libmotovec.a
//------------------------------------------------------------------

Rev 0.30 release - 5/28/2003 by Chuck Corley

This release includes two new files, string_vec.S and checksum_vec.s,
which you could paste into the Linux kernel files:
/arch/ppc/lib/string.S  and
/arch/ppc/lib/checksum.S
if you wanted to employ AltiVec in the Linux kernel.  We used the
memcpy_vec and csum_partial_copy_generic_vec functions from these=20
files only in the modified versions of /net/core/skbuf.c and
/net/core/iovec.c to give us the networking performance boost in
Linux described in the SNDF presentation "Accelerating Networking Data
Movement Using the AltiVecR Technology" at www.motorola.com/sndf under
Dallas-2003/Host Processors (H1110).  Also see the white paper
"Enhanced TCP/IP Performance with AltiVec Technology" at=20
e-www.motorola.com/brdata/PDFDB/docs/ALTIVECTCPIPWP.pdf

These files contain the following functions
string.S contains:                   string_vec.S contains:
memcpy                               memcpy_vec
bcopy                                bcopy_vec
memmove                              memmove_vec
backwards_memcpy                     backwards_memcpy_vec
memset                               memset_vec
memcmp                               memcmp_vec
memchr                               (coming soon)
cacheable_memcpy                     cacheable_memcpy_vec
cacheable_memzero                    cacheable_memzero_vec
strcpy                               strcpy_vec
strncpy                              (coming soon)
strcat                               (coming soon)
strcmp                               strcmp_vec
strlen                               strlen_vec
__copy_tofrom_user*                  __copy_tofrom_user_vec*
__clear_user*                        __clear_user_vec*
__strncpy_from_user*                 (coming soon)
__strnlen_user*                      (coming soon)

checksum.S contains:                 checksum_vec.S contains:
csum_partial                         csum_partial_vec
csum_partial_copy_generic*           csum_partial_copy_generic_vec
ip_fast_csum                         (unlikely to benefit)

csum_tcpudp_magic                    (unlikely to benefit)

*these functions have ex_table entries for handling memory access
exceptions in the kernel.  The AltiVec versions were functionally
tested by hand.

csum_partial_copy_generic_vec and csum_partial_vec previously=20
assembled into libmotovec.a have been removed since they are in the file
above.  We are finding that selective use of the *_vec functions in=20
the OS kernel is much "safer" than wholescale replacement of the libc
library.  libmotovec.a returns to being exclusively a =
performance-enhancing
library of libc functions that can be safely linked with user =
application
code to test the performance of AltiVec.

My presentation for SDNF-Europe includes performance comparisons
of the scalar versus vector versions of the above functions.  It should
be available on the SNDF website soon. It also includes an updated
explanation of memcpy without the potential incoherency problem =
discussed
below.

So this release contains in libmotovec.a:
memcpy.o           from vec_memcpy.S Rev 0.30 dated  4/02/2003
bcopy.o            from vec_memcpy.S Rev 0.30 dated  4/02/2003
memmove.o          from vec_memcpy.S Rev 0.30 dated  4/02/2003
memset.o           from vec_memset.S Rev 0.10 dated  5/01/2003
bzero.o            from vec_memset.S Rev 0.10 dated  5/01/2003
strcmp.o           from vec_strcmp.S Rev 0.00 dated  3/03/2002
strlen.o           from vec_strlen.S Rev 0.00 dated 12/26/2002

And in string.s:
memcpy_vec derived from vec_memcpy.S Rev 0.30 dated  4/02/2003
bcopy_vec                   derived from vec_memcpy.S Rev 0.30
memmove_vec                 derived from vec_memcpy.S Rev 0.30
backwards_memcpy_vec        derived from vec_memcpy.S Rev 0.30
memset_vec derived from vec_memset.S Rev 0.10 dated  5/01/2003
memcmp_vec                  derived from vec_memcmp.S Rev 0.00
memchr                                           (coming soon)
cacheable_memcpy_vec        derived from vec_memcpy.S Rev 0.30
cacheable_memzero_vec       derived from vec_memset.S Rev 0.10
strcpy_vec                  derived from vec_strcpy.S Rev 0.10
strncpy_vec                                      (coming soon)
strcat_vec                                       (coming soon)
strcmp_vec   derived from vec_strcmp.S Rev 0.00 (not released)
strlen_vec   derived from vec_strlen.S Rev 0.00 (not released)
__copy_tofrom_user_vec*     derived from vec_memcpy.S Rev 0.30
__clear_user_vec*           derived from vec_memcpy.S Rev 0.30
__strncpy_from_user_vec*                         (coming soon)
__strnlen_user_vec*                              (coming soon)
*with ex_table and exception code

And in checksum.s:
csum_partial_vec  derived from vec_csum.S Rev 0.0 dated 4/19/03
csum_partial_copy_generic_vec           from vec_csum.S Rev 0.0

string_vec.S and checksum_vec.S are only known to assemble with gcc 2.95
and gcc 3.3+.  Should work with other gcc compilers but may need
editing to be compatible with non-gcc compilers.

Rev 0.20 release - 5/12/2003 by Chuck Corley

Thanks to all of you who attended SNDF.  My presentation "Implementing
and Using the Motorola AltiVec Libraries" is available for downloading=20
at www.motorola.com/sndf under Dallas-2003/Host Processors (H1109).=20

During the presentation DS from Lucent pointed out that the way I was
bringing the beginning and ending destination Quad Words (vectors) into
the registers for merging with the permuted source made the
"uninvolved" destination bytes vulnerable to potential incoherency if
some interrupting process changed those bytes while I was holding them
in a register.  While the possibility seemed small, I have rewritten the
code to avoid this potential problem.  The result actually is slightly=20
faster than the original for small buffers.

So this release contains:
memcpy.o       from vec_memcpy.S Rev 0.30 dated 4/02/2003
bcopy.o        from vec_memcpy.S Rev 0.30 dated 4/02/2003
memmove.o      from vec_memcpy.S Rev 0.30 dated 4/02/2003
memset.o       from vec_memset.S Rev 0.10 dated 5/01/2003
bzero.o        from vec_memset.S Rev 0.10 dated 5/01/2003
csum_partial_copy_generic_vec from vec_csum.S Rev 0.0 dated 4/19/03
csum_partial_vec from vec_csum.S Rev 0.0 dated 4/19/03

The latter two additions were assembled into libmotovec.a despite the
fact they are not standard libc functions.  Rather they are the Altivec
enabled equivalents of functions by the same name from the linux
source tree (Linux 2.4.17).  While we are pursuing how to get these
functions incorporated into Linux, here they are assembled and in
source form if you are building your own version of linux.  The use
of an earlier version of csum_partial_copy_generic_vec and memcpy_vec is =

documented to speed up TCP/IP and UDP transfers in Jacob Pan's SNDF
presentation "Accelerating Networking Data Movement Using AltiVec
Technology" (H1110) available at the website above.  csum_partial
does not appear to be called with large enough buffer sizes in linux=20
to warrant using the vectorized version.

I am also releasing the source for memset and bzero in this release.
strcpy, strlen, strncpy, strcmp, memcmp, strcat, and memchr are still=20
on my list to do - soon.

Rev 0.10 release - 3/13/2003 by Chuck Corley

The presence of dcbz in the 32 byte loop of memcpy (or memmove)
causes an alignment exception to non-cacheable memory (MPC7410 User's
Manual p. 4-20 and MPC7450 User's Manual p. 4-25) so it was=20
removed in this release.  dcbz instructions were not present in=20
memset in any of these releases.  That fixed the alignment problem=20
but hurt the performance some; then it was "rediscovered" that
dcba would have been a better choice anyway as it does not cause=20
an exception; it would just be noop'ed.  So this release substitutes
dcba for dcbz.

This release contains improvements in memcpy that should be
documented in an application note which is still not finished but
are being pretty nicely documented for SNDF presentation H1109.

The memcpy was further loop unrolled to provide a 128B loop for
large buffers (>256 bytes) and the data stream touch instruction
was added.  It may still be possible to improve the tuning of
the dst instruction, particularly in memmove, but this release
is worthy of reving the number to the next significant revision.

I've developed a new metric which will be explained at SNDF in
Dallas, TX, March 23-26, 2003.  As the number of bytes in a=20
buffer gets larger, the memcpy routine settles into repetitions
of the inner loop.  32 bytes were moved in the inner loop of
Rev 0.0x and 128 bytes are moved in the inner loop of Rev 0.10.
And the number of processor clocks per inner loop can be shown
to approach the minimum possible.  Therefore the new metric
measures the incremental transfer rate for the inner loop after=20
a reasonable number (>512) of bytes have been moved.  This will
not be the bytes transferred per second because there were some
less efficient transfers at start-up but this is the transfer
rate that the routine is asymptotically approaching as the buffer
gets big (regularly testing to 1460 bytes).

Here is that metric for several cases:

Case 1: For gcc's lib c memcpy when buffers are not word aligned=20
Case 2: For gcc's lib c memcpy when buffers are word aligned=20
Case 3: For Rev 0.01 of memcpy with Altivec irrespective of alignment
Case 4: For Rev 0.10 of memcpy with Altivec irrespective of alignment

Numbers are provided for the cold DCache and warm DCache.  Code is
assumed to always be resident in the ICache as would be expected here
where the inner loop has run multiple times.

                                   COLD DCACHE           WARM DCACHE
 FOR THE MPC7410@400/100     Insts  Clks   MB/Sec   Insts   Clks  MB/Sec
Case 1: gcc_NWA (1 byte/loop)  6     6       71       6      3     133
Case 2: gcc_WA (16 B/loop)    12    62      103      12      8     800   =
 =20
Case 3: vec_memcpy Rev 0.01   12    60      213      12      7    1961
Case 4: vec_memcpy Rev 0.10   46   125      410      46     41    1250


                                   COLD DCACHE           WARM DCACHE
 FOR THE MPC7445@1GHz/133   Insts  Clks   MB/Sec   Insts   Clks  MB/Sec
Case 1: gcc_NWA               6     8       122       6      3     350=20
Case 2: gcc_WA                12   104      153      12     12    1333

Case 3: vec_memcpy Rev 0.01   12   110      292      12      7    4413 =20
Case 4: vec_memcpy Rev 0.10   46   247      518      46     35    3666

Perhaps you notice that we are trading off Warm DCache performance to
improve the Cold DCache case.  There are other interesting tradeoffs
in going from 32 byte inner loop to 128 bytes.  And in using the dcba
instruction - or not.  In other words, the numbers for vec_memcpy above
are not the highest possible in the Warm DCache case but they look like
a good compromise which most benefits the Cold DCache case.  More at =
SNDF
(or eventually in the app note) ...

I am releasing the source code to vec_memcpy.S with this release so if
if you don't like the tradeoff above you can make your own selection.  =
It
successfully assembles for me with Codewarrior, Diab, Green Hills, gcc,
and Metaware.  It is nicely commented but could use more documentation.
I will specifically be explaining it in SNDF presentation H1109.

*************************************************************************=


Rev 0.01 release - 2/17/2003 by Chuck Corley

Fixed a problem at Last_ld_fwd: that caused a load beyond a page
boundary and resulting segment fault in Linux.  Last source load=20
of SRC+BK in vec_memcpy could be > SRC+BC-1.  Also found and fixed
an error where the Quick and Dirty (QND) code that was in there for
dst wasn't completely commented out.  Plan to enable dst soon.
Probably loop unroll to 128 bytes first though.

**********************************************************************

Initial Release - 2/10/2003 by Chuck Corley

Contains the libc functions:
memcpy.o       from vec_memcpy.S Rev 0.0 dated 2/09/2003
bcopy.o        from vec_memcpy.S Rev 0.0 dated 2/09/2003
memmove.o      from vec_memcpy.S Rev 0.0 dated 2/09/2003
memset.o       from vec_memset.S Rev 0.0 dated 2/09/2003
bzero.o        from vec_memset.S Rev 0.0 dated 2/09/2003

These functions are implemented in AltiVec but are still not as fast
as we know how to make them.  Watch this site for frequent revisions=20
over the next several months.

We are in the process of creating application notes to explain the=20
source code and the performance associated with these library functions;
watch this site for those application notes to be added.  A logical=20
deadline for completion of this work is the Smart Network Developers
Forum in Dallas, TX, March 23-26, 2003, where we will be discussing this =

library, its performance, and application.

We will also be adding the following libc functions in the very near =
future:
strcpy
strcmp
strlen
memcmp
memchr
strncpy

We also have preliminary work completed on the following functions=20
found in Linux and have to figure out how to distribute them:
csum_partial
csum_partial_generic
__copy_tofrom_user
page_copy

We believe that these libraries will improve performance on Motorola G4
processors for applications that make heavy use of the included =
functions.
On non-G4 microprocessors they will cause illegal operation exceptions
because those processors do not support AltiVec.

To use this library, you must:
1. Include it on the linker command line prior to the compiler's libc
library.

Examples:
For gcc:
powerpc-eabisim-ld -T../../spprt/gcc_dink.script -Qy -dn -Bstatic
../../spprt/gcc_obj/gcc_crt0.o  ../../spprt/gcc_obj/dtime.o
../../spprt/gcc_obj/cache.o  ../../spprt/gcc_obj/Support.o
../../spprt/gcc_obj/dinkusr.o  ../../spprt/gcc_obj/perfmon.o
gcc_obj/test_memmove.o c:\BMS\vec_lib\libmotovec\libmotovec.a
c:/cygwin/Altivec/powerpc-eabisim\lib\libm.a --start-group -lsim -lc
--end-group -o gccBM.elf

For Diab:
dld ../../spprt/diab_dink.dld ../../spprt/diab_obj/diab_crt0.o
../../spprt/diab_obj/dtime.o  ../../spprt/diab_obj/cache.o
../../spprt/diab_obj/Support.o  ../../spprt/diab_obj/dinkusr.o
../../spprt/diab_obj/perfmon.o diab_obj/test_memmove.o
c:\BMS\vec_lib\libmotovec\libmotovec.a  -Y
P,c:/diab/5.0.3/PPCEH:c:/diab/5.0.3/PPCE/simple:c:/diab/5.0.3/PPCE:c:/dia=
b/5
.0.3/PPCEN -lc -lm -o diabBM.elf

For Green Hills:
elxr -T../../spprt/ghs_dink.lnk ../../spprt/ghs_obj/ghs_crt0.o
../../spprt/ghs_obj/dtime.o  ../../spprt/ghs_obj/cache.o
../../spprt/ghs_obj/Support.o  ../../spprt/ghs_obj/dinkusr.o
../../spprt/ghs_obj/perfmon.o ghs_obj/test_memmove.o
c:\BMS\vec_lib\libmotovec\libmotovec.a  -Lc:\GHS\ppc36\ppc  -lansi -lsys
-larch -lind -o ghsBM.elf

For CodeWarrior:
mwldeppc -lcf ../../spprt/cw_dink.lcf -nostdlib -fp fmadd -proc 7450
../../spprt/cw_obj/cw_crt0.o  ../../spprt/cw_obj/dtime.o
../../spprt/cw_obj/cache.o  ../../spprt/cw_obj/Support.o
../../spprt/cw_obj/dinkusr.o  ../../spprt/cw_obj/perfmon.o
cw_obj/test_memmove.o c:\BMS\vec_lib\libmotovec\libmotovec.a  =
-Lc:/"Program
Files"/Metrowerks/CodeWarrior/PowerPC_EABI_Support/Runtime/Lib/
-lRuntime.PPCEABI.H.a  -Lc:/"Program
Files"/Metrowerks/CodeWarrior/PowerPC_EABI_Support/Msl/MSL_C/Ppc_eabi/Lib=
/
-lMSL_C.PPCEABI.bare.H.a -o cwBM.elf

For Metaware:
ldppc ../../spprt/mw_link.txt -Bnoheader -Bhardalign -dn -q -Qn
../../spprt/mw_obj/mw_crt0.o  ../../spprt/mw_obj/dtime.o
../../spprt/mw_obj/cache.o  ../../spprt/mw_obj/Support.o
../../spprt/mw_obj/dinkusr.o  ../../spprt/mw_obj/perfmon.o
mw_obj/test_memmove.o c:\BMS\vec_lib\libmotovec\libmotovec.a  -Y
P,c:/hcppc/lib/be/fp -lct -lmwt -o mwBM.elf


2. Enable AltiVec in the Machine State Processor (MSR) register of the
target machine.

Example:
AltiVec_enable:
	mfmsr	r4		// Get current MSR contents
	oris	r4,r4,0x0200	// Set the AltiVec enable bit MSR[6]
	mtmsr	r4		// Write to MSR
	isync			// Context synchronizing instr after mtmsr


3. If the AltiVec vector register set is used in more than one context,
the AltiVec registers must be saved and restored on context switches.  =
The
AltiVec EABI extensions define a register (SPR 256 - the VRSAVE =
register)
which can be used to reduce the number of vector registers which have to
be saved to only those in use.  This library is currently compiled
without that VRSAVE feature enabled, so all 32 vector registers will =
have
to be saved and restored.  We are currently thinking that this is a more
efficient practice anyway and note that Linux and several RTOSes are =
taking
that approach in saving and restoring the vector registers.  We have
observed
very little performance difference in Linux for saving all of the =
AltiVec=20
registers on a context switch versus saving only 8.  And saving all of =
the=20
registers is a less than 1% total impact on performance.

4. There is one worrisome problem with this library when run on the =
MPC745X
microprocessors in the 60x bus mode.  The MPC7450 Family User's Manual
(Section 7.3) states that "The 60x bus protocol does not support a =
16-byte
bus transaction.  Therefore, cache-inhibited AltiVec loads, stores, and
write-through stores take an alignment exception.  This requires a =
re-write
of the alignment exception routines in software that supports AltiVec =
quad
word access in 60x bus mode on the MPC745X."

This says that if the user is attempting to use these routines in a
cache-inhibited area of memory on a MPC745X in 60x bus mode, it will =
require
special alignment exception handling software.  We are currently
implementing
that software for the Linux OS.  Alternatively, the user can restrict =
this=20
library's use to areas of memory known to be cacheable.

This library was built using gcc, but as shown in the examples of step 1
above,
links and executes with Diab5.0, Green Hills 3.6, Codewarrior EPPC 6.1, =
and
Metaware 4.5.  The gcc archiver was used to create it in the following=20
command lines:

powerpc-eabisim-gcc -c -s -fvec -mcpu=3D750 -mregnames   -I. -I./source
-I../../spprt -Ic:/cygwin/Altivec\powerpc-eabisim\include
-Ic:/cygwin/Altivec\lib\gcc-lib\powerpc-eabisim\gcc-2.95.2\include -o
gcc_obj/vec_memcpy.o -D__GNUC__  -DLIBMOTOVEC
../vec_memcpy/Source/vec_memcpy.S -o gcc_obj/vec_memcpy.o

powerpc-eabisim-gcc -c -s -fvec -mcpu=3D750 -mregnames   -I. -I./source
-I../../spprt -Ic:/cygwin/Altivec\powerpc-eabisim\include
-Ic:/cygwin/Altivec\lib\gcc-lib\powerpc-eabisim\gcc-2.95.2\include -o
gcc_obj/vec_memset.o -D__GNUC__  -DLIBMOTOVEC
../vec_memset/source/vec_memset.S -o gcc_obj/vec_memset.o

powerpc-eabisim-ar -ru libmotovec.a gcc_obj/vec_memcpy.o
gcc_obj/vec_memset.o



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4D87884B6A6D4E438A8592BCC9C85DCA079C31D9>