Date: Fri, 1 Jun 2018 03:15:58 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Mateusz Guzik <mjguzik@gmail.com> Cc: Bruce Evans <brde@optusnet.com.au>, Mateusz Guzik <mjg@freebsd.org>, src-committers <src-committers@freebsd.org>, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r334419 - head/sys/amd64/amd64 Message-ID: <20180601014718.D3606@besplex.bde.org> In-Reply-To: <CAGudoHEOCxC6PSFSyQu6aPRUBEJ5Hp5TBD68TUpaxEy_14PhAQ@mail.gmail.com> References: <201805310956.w4V9u2rL084194@repo.freebsd.org> <20180531201225.L2478@besplex.bde.org> <CAGudoHEOCxC6PSFSyQu6aPRUBEJ5Hp5TBD68TUpaxEy_14PhAQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 31 May 2018, Mateusz Guzik wrote: > On Thu, May 31, 2018 at 09:19:58PM +1000, Bruce Evans wrote: >> On Thu, 31 May 2018, Mateusz Guzik wrote: >> >>> Log: >>> amd64: switch pagecopy from non-temporal stores to rep movsq >> >> As for pagezero, this pessimizes for machines with slow movsq and/or > caches >> (mostly older machines). > > Can you give examples of such machines? I tested with old yellers like > Nehalem and Westmere, no loss. Original Athlon64, and Turion2 on a 2006 laptop. I already mentioned Turion64, and my commit to fix the loss of nontemporal pagezero on amd64 gives timing info for both in a mixed-up way (only the Athlon has PC3200). sse2_pagezero was actually connected at the time, but only to idlezero and that was removed soon after. Nontemporal stores are clearly best for idlezero, but doing anything in idle is not so good since it might wasted power or steal resources from a shared core or increase latency... It was good on the Turion2 in 2007. Turion2 doesn't have a shared core and or many Cx states so it uses almost as much power zeroing pages as idling. >>> The copied data is accessed in part soon after and it results with > additional >>> cache misses during a -j 1 buildkernel WITHOUT_CTF=yes KERNFAST=1, as > measured >>> with pmc stat. >> >> Of course it causes more cache misses later, but for large data going > through >> slow caches is much slower so the cache misses later cost less. > > The note was predominantly for people who would want to defend nt stores > claiming it prevents evicting cached data by data being copied and then > mostly not accessed. I read it more carefully and can interpret it to say the opposite of what you want. Since a new system gets no benefit in real time, the only significant differences are probably tiny power savings on new systems and slower runtimes on older systems. However, I saw tiny improvements in real time for makeworld with pagecopy = bcopy on Haswell. Well below 1%, while improvements for pagezero = bzero were closer to 1%. I now have better statistics generation and analysis and recently spent a lot of time trying to verify scheduler improvements of about 1%. >> It is negatively useful to write this in asm. This is now just memcpy() >> and the asm version of that is fast enough, though movsq takes too long >> to start up. This memcpy() might be inlined and then it would be >> insignificantly faster than the function call. __builtin_memcpy() won't >> actually inline it, since its size is large and compilers know that they >> don't understand memory. > > It is true that currently it can be the current memcpy with almost no loss. > > However, even on a kernel with #define memcpy __builtin_memcpy, there > are plenty of calls with very small sizes. See the list here (taken > during buildkernel): > > https://people.freebsd.org/~mjg/bufsizes.txt > > In particular you can find a lot of < 64 entries. But pagecopy is 4K. That is still too small to amortize string instruction overhead for Haswell in the cached case -- see my old mail -- by not much is to be gained by using a specialized version since the cached case is very fast. > Spinning up rep stosb for such sizes even with ERMS turns out to be > pessimal even on Skylake. In other words, the primitive will need to get > special casing for small-sized callers. Known big-size callers should be > moved to something else. As such, pointing pagecopy at the primitive is > imo a bad idea. That is with most current implementations of ERMS. I expect the startup overhead will be small after a couple more generations of CPUs. Then optimizations to not use string optimizations will be as silly as 30-year old optimizations to use them. Or my 20 year old optimizations to use the FPU for bcopy, bzero, copyin and copyout, but not pagezero or pagecopy. This optimization was good for just 1 generation of CPUs (Pentium1). i386 still has a silly 20 year old i686_pagezero which is still used on all i386's that don't have SSE2 (not many of these now). This would have been good for just 1 or 2 generations of CPUs (PentiumPro and maybe Celeron) if it were written correctly. It is intended to avoid writing zeros to cache lines that are already zero, as was good on PentiumPro. But it actually zeros almost everything after finding a nonzero byte. Thus it is a pessimization even on PentiumPro unless many pages passed to it are already all zero. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180601014718.D3606>