Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 12 Dec 2010 11:52:42 -0500 (EST)
From:      Venkatesh Srinivas <vsrinivas@dragonflybsd.org>
To:        freebsd-hackers@freebsd.org
Subject:   amd64 pmap pagecopy() optimization()?
Message-ID:  <alpine.LFD.2.00.1012121141320.25740@centaur.acm.jhu.edu>

next in thread | raw e-mail | index | archive | help
Hi,

In svn r127653, a microoptimized pagecopy() implementation was added to 
amd64's support.S. The pagecopy() prefetches the entire page first and 
then uses a partly-unrolled loop of loads & non-temporal stores. The 
commit notes 'it is roughly four times faster than bcopy() for uncached 
pages'.

Just wondering, how was this measured? I ported the routine to i386 and 
tried it out in userland, but found it between four and six times slower 
than the BSD and GNU libc bcopy()ies; I admit to not trying very hard to 
measure on only uncached pages though...

Also, why prefetch the entire page before the load / NT store loop? If I 
read the Intel optimization guide correctly, a loop of 
prefetch(n+1) / load / store would be a better call? (I tried this on i386 
also, it was a bit faster than the current style, but still nowhere near 
bcopy()...).

Thanks!
-- vs



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?alpine.LFD.2.00.1012121141320.25740>