From owner-cvs-all@FreeBSD.ORG Sat Jun 28 08:38:56 2008 Return-Path: Delivered-To: cvs-all@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9E9AA10656F2 for ; Sat, 28 Jun 2008 08:38:56 +0000 (UTC) (envelope-from christoph.mallon@gmx.de) Received: from mail.gmx.net (mail.gmx.net [213.165.64.20]) by mx1.freebsd.org (Postfix) with SMTP id 02DD58FC14 for ; Sat, 28 Jun 2008 08:38:55 +0000 (UTC) (envelope-from christoph.mallon@gmx.de) Received: (qmail invoked by alias); 28 Jun 2008 08:38:54 -0000 Received: from p54A3DE64.dip.t-dialin.net (EHLO tron.homeunix.org) [84.163.222.100] by mail.gmx.net (mp021) with SMTP; 28 Jun 2008 10:38:54 +0200 X-Authenticated: #1673122 X-Provags-ID: V01U2FsdGVkX1/W6vRVlffWHrdycWDYzg8jjcwC3eGyrn1j/gPntn 1xsAjEkaf3paAY Message-ID: <4865F89D.4090207@gmx.de> Date: Sat, 28 Jun 2008 10:38:53 +0200 From: Christoph Mallon User-Agent: Thunderbird 2.0.0.9 (X11/20071230) MIME-Version: 1.0 To: Bruce Evans References: <200806252105.m5PL5AUp064418@repoman.freebsd.org> <48654667.1040401@gmx.de> <20080627222404.GJ1215@alchemy.franken.de> <48657058.6020102@gmx.de> <20080628121025.F89039@delplex.bde.org> In-Reply-To: <20080628121025.F89039@delplex.bde.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-Y-GMX-Trusted: 0 Cc: cvs-src@freebsd.org, src-committers@freebsd.org, cvs-all@freebsd.org, Marius Strobl Subject: Re: cvs commit: src/sys/sparc64/include in_cksum.h X-BeenThere: cvs-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: CVS commit messages for the entire tree List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 28 Jun 2008 08:38:56 -0000 Bruce Evans wrote: > On Sat, 28 Jun 2008, Christoph Mallon wrote: > >> I still think, using __volatile only works by accident. volatile for >> an assembler block mostly means "this asm statement has an effect, >> even though the register specification looks otherwise, so do not >> optimise this away (i.e. no CSE, do not remove if result is unused etc.). > > Right. Though I've never seen unnecessary's __volatiles significantly > affecting i386 code. This is because the code in the asms can't be > removed completely, and can't be moved much either. With out of order > execution, the type of moves that are permitted (not across dependencies) > are precisely the type of moves that the CPU's scheduler can do or undo > no matter how the compiler orders the code. I disagree. For example look at the use of in_addword() in dev/sk/if_sk.cv in line 2819: csum1 = htons(csum & 0xffff); csum2 = htons((csum >> 16) & 0xffff); ipcsum = in_addword(csum1, ~csum2 & 0xffff); /* checksum fixup for IP options */ len = hlen - sizeof(struct ip); if (len > 0) { return; } The calculation will be executed even if the following if (len > 0) leaves the function and the value of ipcsum is unused. If in_addword() is not marked volatile it can be moved after the if and not be executed in all cases. csum1 and csum2 can be moved after the if, too. >> On a related note: Is inline assembler really necessary here? For >> example couldn't in_addword() be written as >> static __inline u_short >> in_addword(u_short const sum, u_short const b) >> { >> u_int const t = sum + b; >> return t + (t >> 16); >> } ? >> This should at least produce equally good code and because the >> compiler has more knowledge about it than an assembler block, it >> potentially leads to better code. I have no SPARC compiler at hand, >> though. > > Last time I tried on i386, I couldn't get gcc to generate operations > involving carries for things like this, or the bswap instruction from > C code to reorder a word. gcc4.2 -O3 on i386 now generates for the above: > > movzwl b, %eax # starting from b and sum in memory > movzwl sum, %edx > addl %eax, %edx # 32-bit add > movl %edx, %eax > shrl $16, %eax # it does the shift laboriously > addl %edx, %eax > movzwl %ax, %eax # don't really need 32-bit result > # but need something to discard the high bits If the upper 16 bits are not "looked at" then the final movzwl can be optimised away. Many instructions, like add, shl and mul, can live with "garbage" in the upper 16 bits. Only if a "bad" instruction, like shr or div, is encountered, the upper 16 bits have to be cleared. The current x86 implementation of in_addword() using inline assembler causes the compiler to add a movzwl, too, before the return. > In non-inline asm, I would write this as: > > movw sum,%ax > addw b,%ax > adcw $0,%ax > movzwl %ax,%eax You do not want to use 16bit instructions on modern x86 processors. These instructions are slow. Intel states that decoding a 16bit operation takes 6 cycles instead of the usual 1. (Intel® 64 and IA-32 Architectures Optimization Reference Manual, section 2.1.2.2 Instruction PreDecode) > Pipelining can make bloated code run better than it looks, but probably > not for the generated code above, since shifts are slow on some i386's > and there is an extra dependency for the extra shift operation. Shifts were slow on early generations of the Pentium 4. Intel corrected this "glitch" in later generations. >> In fact the in/out specification for this asm block looks rather bad: >> "=&r" (__ret), "=&r" (__tmp) : "r" (sum), "r" (b) : "cc"); >> The "&"-modifiers (do not use the same registers as for any input >> operand value) force the compiler to use 4 (!) register in total for >> this asm block. It could be done with 2 registers if a proper in/out >> specification was used. At the very least the in/out specification can >> be improved, but I suspect using plain C is the better choice. > > Hmm, the i386 version is much simpler. It just forces use of 2 registers > when 1 is enough (change its constraint for (b) from "r" to "rm" to permit > adcw from either register or memory, so that it can generate the above code > if b happens to be in memory; on most i386's, this optimizes for space but > makes no difference for time). Sometimes a few less bytes in an inner loop can have dramatic effects. Saving a register is always (ok, most of the time) a good idea on x86. On the other hand, the function is inlined so proably its operands do not originate from memory locations. But I still think the better solution is to use simple C instead of inline assembler. Regards Christoph