Date: Tue, 31 Mar 2015 15:49:28 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Konstantin Belousov <kostikbel@gmail.com> Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, Bruce Evans <brde@optusnet.com.au>, John Baldwin <jhb@freebsd.org> Subject: Re: svn commit: r280279 - head/sys/sys Message-ID: <20150331144411.R3908@besplex.bde.org> In-Reply-To: <20150330172434.GG2379@kib.kiev.ua> References: <201503201027.t2KAR6Ze053047@svn.freebsd.org> <20150322080015.O955@besplex.bde.org> <20150322093251.GY2379@kib.kiev.ua> <2526359.g5B2nXdKeQ@ralph.baldwin.cx> <20150330172434.GG2379@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 30 Mar 2015, Konstantin Belousov wrote: > On Mon, Mar 30, 2015 at 11:57:08AM -0400, John Baldwin wrote: >> On Sunday, March 22, 2015 11:32:51 AM Konstantin Belousov wrote: >>> ... >>> So anybody has to compile his own kernel to get popcnt optimization ? >>> We do care about trivial things that improve time. I'm not sure we do. The microoptimizations give speed improvements in the 1-10% range, while general bloat gives speed unimprovements in the 10-500% range. > I updated the pmap patch, see the end of the message. Seems OK, except for verbose names. >>> BTW, I have the following WIP change, which popcnt xorl is a piece of. >>> It emulates the ifuncs with some preprocessing mess. It is much better >>> than runtime patching, and is a prerequisite to properly support more >>> things, like SMAP. I did not published it earlier, since I wanted to >>> convert TLB flush code to this. >> >> This looks fine to me. It seems to be manually converting certain symbols >> to use a dynamic lookup that must be explicitly resolved before first >> use? It looks a bit overengineered to me. A bit like my function pointers for the bcopy() family on i386. bcopy() is a bulk operation, so in theory you can do it much faster by selecting the best available version at runtime. In practice, the gains were not large and are too machine-dependent to maintain. It is even harder to get large gains and maintain them by selecting individual instructions at runtime. > diff --git a/sys/amd64/amd64/pmap.c b/sys/amd64/amd64/pmap.c > index 6a4077c..fcfba56 100644 > --- a/sys/amd64/amd64/pmap.c > +++ b/sys/amd64/amd64/pmap.c > @@ -412,7 +416,7 @@ static caddr_t crashdumpmap; > static void free_pv_chunk(struct pv_chunk *pc); > static void free_pv_entry(pmap_t pmap, pv_entry_t pv); > static pv_entry_t get_pv_entry(pmap_t pmap, struct rwlock **lockp); > -static int popcnt_pc_map_elem(uint64_t elem); > +static int popcnt_pc_map_elem_pq(uint64_t elem); popcnt_pc_map_elem_pq() is the most verbose name used here. It is not necessary to encrypt a function's man page in its name. > @@ -2980,20 +3002,27 @@ retry: > > /* > * Returns the number of one bits within the given PV chunk map element. > + * > + * The erratas for Intel processors state that "POPCNT Instruction May > + * Take Longer to Execute Than Expected". It is believed that the > + * issue is the spurious dependency on the destination register. > + * Provide a hint to the register rename logic that the destination > + * value is overwritten, by clearing it, as suggested in the > + * optimization manual. It should be cheap for unaffected processors > + * as well. > + * > + * Reference numbers for erratas are > + * 4th Gen Core: HSD146 > + * 5th Gen Core: BDM85 > */ > static int > -popcnt_pc_map_elem(uint64_t elem) > +popcnt_pc_map_elem_pq(uint64_t elem) Here the function has a specialized use, but internally it is just the popcntq() cpufunc, except that doesn't have the xorl workaround or the above documentation for the workaround. > { > - int count; > + u_long result; > > - /* > - * This simple method of counting the one bits performs well because > - * the given element typically contains more zero bits than one bits. > - */ > - count = 0; > - for (; elem != 0; elem &= elem - 1) > - count++; > - return (count); > + __asm __volatile("xorl %k0,%k0;popcntq %1,%0" > + : "=&r" (result) : "rm" (elem)); > + return (result); > } Style: - there should be a space after ";" in the asm. - loading 0 should be in the operands. Then you don't need the "k" modification. It should be able to load 0 using xorl iff that is best for the current arch. Perhaps a register is already 0, and it is best to do a null load so as to use that register. The patch is missing removal of the popcntq() cpufunc. This function was only used here, and is now unused. cpufunc.h is supposed to contain only wrappers for single instructions, to be combined if necessary to create larger functions. popcntq() follows this rule (some other functions are broken). popcntq() with the load of 0 wouldn't satisify it strictly, but there is no other way to fix it since register allocation is unavailable across inline asms. Actually, if you move the load to the operands, it would follow the rule strictly enough -- many of the asms need more than a single instruction to load things. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150331144411.R3908>