Date: Sun, 12 Dec 2010 11:52:42 -0500 (EST) From: Venkatesh Srinivas <vsrinivas@dragonflybsd.org> To: freebsd-hackers@freebsd.org Subject: amd64 pmap pagecopy() optimization()? Message-ID: <alpine.LFD.2.00.1012121141320.25740@centaur.acm.jhu.edu>
next in thread | raw e-mail | index | archive | help
Hi, In svn r127653, a microoptimized pagecopy() implementation was added to amd64's support.S. The pagecopy() prefetches the entire page first and then uses a partly-unrolled loop of loads & non-temporal stores. The commit notes 'it is roughly four times faster than bcopy() for uncached pages'. Just wondering, how was this measured? I ported the routine to i386 and tried it out in userland, but found it between four and six times slower than the BSD and GNU libc bcopy()ies; I admit to not trying very hard to measure on only uncached pages though... Also, why prefetch the entire page before the load / NT store loop? If I read the Intel optimization guide correctly, a loop of prefetch(n+1) / load / store would be a better call? (I tried this on i386 also, it was a bit faster than the current style, but still nowhere near bcopy()...). Thanks! -- vs
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?alpine.LFD.2.00.1012121141320.25740>