Date: Sat, 28 Jun 2008 12:45:43 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Christoph Mallon <christoph.mallon@gmx.de> Cc: cvs-src@freebsd.org, src-committers@freebsd.org, cvs-all@freebsd.org, Marius Strobl <marius@alchemy.franken.de> Subject: Re: cvs commit: src/sys/sparc64/include in_cksum.h Message-ID: <20080628121025.F89039@delplex.bde.org> In-Reply-To: <48657058.6020102@gmx.de> References: <200806252105.m5PL5AUp064418@repoman.freebsd.org> <48654667.1040401@gmx.de> <20080627222404.GJ1215@alchemy.franken.de> <48657058.6020102@gmx.de>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 28 Jun 2008, Christoph Mallon wrote: > I still think, using __volatile only works by accident. volatile for an > assembler block mostly means "this asm statement has an effect, even though > the register specification looks otherwise, so do not optimise this away > (i.e. no CSE, do not remove if result is unused etc.). Right. Though I've never seen unnecessary's __volatiles significantly affecting i386 code. This is because the code in the asms can't be removed completely, and can't be moved much either. With out of order execution, the type of moves that are permitted (not across dependencies) are precisely the type of moves that the CPU's scheduler can do or undo no matter how the compiler orders the code. > On a related note: Is inline assembler really necessary here? For example > couldn't in_addword() be written as > static __inline u_short > in_addword(u_short const sum, u_short const b) > { > u_int const t = sum + b; > return t + (t >> 16); > } ? > This should at least produce equally good code and because the compiler has > more knowledge about it than an assembler block, it potentially leads to > better code. I have no SPARC compiler at hand, though. Last time I tried on i386, I couldn't get gcc to generate operations involving carries for things like this, or the bswap instruction from C code to reorder a word. gcc4.2 -O3 on i386 now generates for the above: movzwl b, %eax # starting from b and sum in memory movzwl sum, %edx addl %eax, %edx # 32-bit add movl %edx, %eax shrl $16, %eax # it does the shift laboriously addl %edx, %eax movzwl %ax, %eax # don't really need 32-bit result # but need something to discard the high bits In non-inline asm, I would write this as: movw sum,%ax addw b,%ax adcw $0,%ax movzwl %ax,%eax Pipelining can make bloated code run better than it looks, but probably not for the generated code above, since shifts are slow on some i386's and there is an extra dependency for the extra shift operation. > In fact the in/out specification for this asm block looks rather bad: > "=&r" (__ret), "=&r" (__tmp) : "r" (sum), "r" (b) : "cc"); > The "&"-modifiers (do not use the same registers as for any input operand > value) force the compiler to use 4 (!) register in total for this asm block. > It could be done with 2 registers if a proper in/out specification was used. > At the very least the in/out specification can be improved, but I suspect > using plain C is the better choice. Hmm, the i386 version is much simpler. It just forces use of 2 registers when 1 is enough (change its constraint for (b) from "r" to "rm" to permit adcw from either register or memory, so that it can generate the above code if b happens to be in memory; on most i386's, this optimizes for space but makes no difference for time). Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080628121025.F89039>