From owner-freebsd-arm@FreeBSD.ORG Tue Jul 22 07:16:27 2014 Return-Path: Delivered-To: freebsd-arm@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 8C282774; Tue, 22 Jul 2014 07:16:27 +0000 (UTC) Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by mx1.freebsd.org (Postfix) with ESMTP id 2075D24DE; Tue, 22 Jul 2014 07:16:26 +0000 (UTC) Received: from c122-106-147-133.carlnfd1.nsw.optusnet.com.au (c122-106-147-133.carlnfd1.nsw.optusnet.com.au [122.106.147.133]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id A59C410E2951; Tue, 22 Jul 2014 17:16:09 +1000 (EST) Date: Tue, 22 Jul 2014 17:16:08 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Mark Linimon Subject: Re: [CFR] mge driver / elf reloc In-Reply-To: <20140721233106.GA17346@lonesome.com> Message-ID: <20140722161501.K865@besplex.bde.org> References: <14D22EA6-B73C-47BA-9A86-A957D24F23B8@freebsd.org> <1405810447.85788.41.camel@revolution.hippie.lan> <20140720220514.GP45513@funkthat.com> <20140720231056.GQ45513@funkthat.com> <9464C309-B390-4A27-981A-E854921B1C98@bsdimp.com> <1405955048.85788.74.camel@revolution.hippie.lan> <20140722022100.S2586@besplex.bde.org> <20140721233106.GA17346@lonesome.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=eojmkOZX c=1 sm=1 tr=0 a=7NqvjVvQucbO2RlWB8PEog==:117 a=PO7r1zJSAAAA:8 a=5zrNXb6StqYA:10 a=kj9zAlcOel0A:10 a=JzwRw_2MAAAA:8 a=hmZ8BYfCJ5Q0rm1RuK8A:9 a=CjuIK1q_8ugA:10 Cc: arch@freebsd.org, freebsd-arm , Ian Lepore , Bruce Evans X-BeenThere: freebsd-arm@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: "Porting FreeBSD to ARM processors." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Jul 2014 07:16:27 -0000 On Mon, 21 Jul 2014, Mark Linimon wrote: > On Tue, Jul 22, 2014 at 03:53:10AM +1000, Bruce Evans wrote: >> This is with gcc. clang doesn't work on ia64 and/or pluto. > > Since Marcel has dropped support for ia64, and in fact removed ia64- > specific code from -HEAD, I'm not sure how much good this analysis > will accomplish :-) ia64/pluto is just an example of an arch with strict alignment requirements. clang is broken for it so I could only test with gcc. This analysis applies to all non-x86 arches in FreeBSD cluster machines, since there aren't many others and the only other one (sparc64/flame) also has strict alignment requirements. clang is broken on it too, so I could only test with gcc: % #include % % struct foo { % int x; % } __packed; % % struct foo x; % % static __inline uint32_t % xle32dec(const void *_p) % { % uint32_t _t; % % __builtin_memcpy(&_t, _p, sizeof(_t)); % return (_t); % } % % static __inline void % xle32enc(void *_p, uint32_t _u) % { % % __builtin_memcpy(_p, &_u, sizeof(_u)); % } % % uint32_t % q(void) % { % return xle32dec(&x); % } % % void % r(void) % { % return xle32enc(&x, 1); % } This tests the memcpy versions. __packed gives the expected mess: % % .file "z.c" % .section ".text" % .align 4 % .align 32 % .global r % .type r, #function % .proc 020 % r: % .register %g2, #scratch % .register %g3, #scratch % add %sp, -208, %sp % mov 1, %g1 % st %g1, [%sp+2235] % sethi %hi(x), %g2 % or %g2, %lo(x), %g3 % ldub [%sp+2235], %g1 % stb %g1, [%g2+%lo(x)] % ldub [%sp+2236], %g1 % stb %g1, [%g3+1] % ldub [%sp+2237], %g1 % stb %g1, [%g3+2] % ldub [%sp+2238], %g1 % stb %g1, [%g3+3] % jmp %o7+8 % sub %sp, -208, %sp % .size r, .-r % .align 4 % .align 32 % .global q % .type q, #function % .proc 016 % q: % add %sp, -208, %sp % sethi %hi(x), %g1 % or %g1, %lo(x), %g2 % ldub [%g1+%lo(x)], %g1 % stb %g1, [%sp+2235] % ldub [%g2+1], %g1 % stb %g1, [%sp+2236] % ldub [%g2+2], %g1 % stb %g1, [%sp+2237] % ldub [%g2+3], %g1 % stb %g1, [%sp+2238] % lduw [%sp+2235], %o0 % jmp %o7+8 % sub %sp, -208, %sp % .size q, .-q % .common x,4,1 % .ident "GCC: (GNU) 4.2.1 20070831 patched [FreeBSD]" I think both functions copy the memory bytewise (4+4 memory references) and do 1 load of the final copy or 1 store to the temporary copy. So the memcpy is not virtual, and the memcpy versions might be worse than the -current versions which should use 4+1 memory references plus lots of shifts and masks on a registers. Register operations are faster but there are many more of them. Removing __packed gives the expected direct accesses: % .file "z.c" % .section ".text" % .align 4 % .align 32 % .global r % .type r, #function % .proc 020 % r: % .register %g2, #scratch % add %sp, -208, %sp % mov 1, %g2 % sethi %hi(x), %g1 % st %g2, [%g1+%lo(x)] % jmp %o7+8 % sub %sp, -208, %sp % .size r, .-r % .align 4 % .align 32 % .global q % .type q, #function % .proc 016 % q: % add %sp, -208, %sp % sethi %hi(x), %g1 % lduw [%g1+%lo(x)], %o0 % jmp %o7+8 % sub %sp, -208, %sp % .size q, .-q % .common x,4,4 % .ident "GCC: (GNU) 4.2.1 20070831 patched [FreeBSD]" The memcpy's seem to be virtual now. Maybe the compiler is avoiding the shifts and masks for the packed case intentionally. Timing tests on flame and pluto showed some problems. The memcpy versions are mostly faster in the non-__packed case and slower in the __packed case. This is as expected. The above case where the compiler virtulalize the memcpy is especially slow, as expected, but there are some other slow cases, an lots of differences between flame and pluto. The __packed case is 4-20 times slower. Bruce