Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 11 May 2018 03:31:46 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Konstantin Belousov <kib@freebsd.org>
Cc:        src-committers@freebsd.org, svn-src-all@freebsd.org,  svn-src-head@freebsd.org
Subject:   Re: svn commit: r333461 - head/sys/amd64/amd64
Message-ID:  <20180511012309.V3949@besplex.bde.org>
In-Reply-To: <201805101501.w4AF1iI0039082@repo.freebsd.org>
References:  <201805101501.w4AF1iI0039082@repo.freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 10 May 2018, Konstantin Belousov wrote:

> Log:
>  Make fpusave() and fpurestore() on amd64 ifuncs.
>
>  From now on, linking amd64 kernel requires either lld or newer ld.bfd.

This breaks building with gcc:

XX cc1: warnings being treated as errors
XX ../../../amd64/amd64/fpu.c:195: warning: 'ifunc' attribute directive ignored [-Wattributes]
XX ../../../amd64/amd64/fpu.c:195: warning: redundant redeclaration of 'fpusave' [-Wredundant-decls]
XX ./machine/fpu.h:64: warning: previous declaration of 'fpusave' was here
XX ../../../amd64/amd64/fpu.c:202: warning: 'ifunc' attribute directive ignored [-Wattributes]
XX ../../../amd64/amd64/fpu.c:202: warning: redundant redeclaration of 'fpurestore' [-Wredundant-decls]
XX ./machine/fpu.h:62: warning: previous declaration of 'fpurestore' was here

After building fpu.o with clang, linking doesn't reqire a newer ld, but
booting does -- after linking with an older ld, the boot panics with a
page fault near fork().

I used the current as and ld with old gcc until early this year.  This is
a bit fragile and regressed with lld.  The regression was minor -- lld
expanded the bss in the data section, giving a few MB of bloat.  Now the
bloat relative to old ld is only 29K in the text section.

Debugging and presumably profiling is also broken:

XX db> x/ia fpusususpend,20
XX fpususpend:     pushq   %rbp
XX fpususpend+0x1: movq    %rsp,%rbp
XX fpususpend+0x4: pushq   %rbx
XX fpususpend+0x5: pushq   %rax
XX fpususpend+0x6: movl    %cr0,%ebx
XX fpususpend+0x9: clts
XX fpususpend+0xb: call    0xffffffff80299960
XX fpususpend+0x10:        movl    %ebx,%cr0
XX fpususpend+0x13:        addq    $0x8,%rsp
XX fpususpend+0x17:        popq    %rbx
XX fpususpend+0x18:        popq    %rbp
XX fpususpend+0x19:        ret
XX db> x/ia 0xffffffff80299960
XX 0xffffffff80299960:     jmpl    *0x56769a(%rip)

ddb still doesn't understand pc-relative offsets.  Decoding this with
"p/x 0xffffffff80299960+6" gives brwsection+0x1000.  At this address
is a pointer to fpusave_xsave since the hardware supports xsave.

This looks like a small or null pessimization.  The branch in the old
version is probably faster.  The following simple test on Haswell shows
that all reasonable versions have almost the same speed when cached:

XX #include <sys/cdefs.h>
XX 
XX #ifndef FUNC
XX #define	FUNC	cfoo
XX #endif
XX 
XX int FUNC(void);
XX 
XX int asmifunctargfoo(void);
XX int (*ifuncaddr)(void) = asmifunctargfoo;
XX int xsave = 1;
XX 
XX int
XX cfoo(void)
XX {
XX 	return (xsave != 0 ? 2 : 1);
XX }
XX 
XX __asm(".text; asmbranchfoo: cmpl $0,xsave(%rip); je 1f; movl $2,%eax; ret; 1: movl $1,%eax; ret");
XX __asm(".text; asmnobranchfoo: cmpl $0,xsave(%rip); setne %al; movzbl %al,%eax; ret");
XX __asm(".text; asmifuncfoo: jmp *ifuncaddr(%rip)");
XX __asm(".text; asmifunctargfoo: movl $2,%eax; ret");	/* for xsave != 0 */
XX 
XX int
XX main(void)
XX {
XX 	volatile int i, res;;
XX 
XX 	res = 0;
XX 	for (i = 0; i < 1000000000; i++)
XX 		res += FUNC();
XX 	return (0);
XX }

Compile this with gcc -o foo foo.c -O -FUNC={cfoo,asmbranchfoo,asmnobranchfoo,
asmifuncfoo).

cfoo, asmnobranchfoo and asmifuncfoo take 1.71-1.72 seconds (~7 cycles at
4.08 Ghz) on Haswell.  Only asmnobranchfoo is slower -- it takes about
1.78 seconds.  That is a whole quarter of a cycle longer (7.25 cycles).

The uncached case is hard to test.  This loop runs in parallel with itself
as far as possible, so the main limit is the number of dependencies and
latency for branch prediction might not matter provided it is less than
the loop time of 7 cycles.  In normal use, probably nothing stays cached
because it is rarely called.  This also means that optimizations are in
the noise.

I think the direct branch can be statically predicted better than the
indirect branch, and it is easy to arrange using excessive __predict_*
optimizations that it is right on some hardware.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180511012309.V3949>