Date: Fri, 4 Jun 2010 17:30:47 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: George Neville-Neil <gnn@freebsd.org> Cc: net@freebsd.org Subject: Re: A slight change to tcpip_fillheaders... Message-ID: <20100604165857.D28688@delplex.bde.org> In-Reply-To: <54198502-A432-4FA7-9176-0AB85D809597@freebsd.org> References: <0BC7AD09-B627-4F6A-AD93-B7E794A78CA2@freebsd.org> <20100603181439.Q27699@delplex.bde.org> <54198502-A432-4FA7-9176-0AB85D809597@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 3 Jun 2010, George Neville-Neil wrote: > For what it's worth I checked the assembly for both versions as well. The bzero > version does not inline, as you said, and the original does do a move of > 0 for each and every field, again on Nehalem with our default version of > gcc. > > I think that for now I will leave this alone, the code is clear either way, > and what I cared about was finding out if the code could be sped up. I couldn't find any options to make gcc-4.2.1 coalesce the assignments in the following simple example: %%% struct foo { char x; char y; }; xx(struct foo *fp) { fp->x = 0; fp->y = 0; } %%% The non-coalesced version may be a bottleneck in the instruction stream in some relatively rare cases. The worst case seems to be non-coalescing 8 8-bit variables on a 64-bit arch. (gcc does do the coalescing for bit-fields, else the worst cast would be 64 assignments of 1-bit bit-fields generating 3*64 micro-instructions (3 for each assignment to preserve nearby bits).) But since there are no dependencies between these assignments they are easy to schedule, and 8 instructions isn't many (they probably take 4 cycles). struct ip has 11 separate fields (after combining the bit-fields). 11 instructions for these is a few, the extern bzero() takes almost that many just to call; then on i386 it takes 12 instructions internally for administrivia and 5 instructions internally to do the work; on amd64 it takes 7 instructions interally for administivia and 6 instructions internally to do the work (amd64 bzero actually does more assignments internally -- ones of size 8,8,1,1,1,1 instead of ones of size 4,4,4,4,4; it could do fewer, but only at a cost of more for administrivia). The function call instructions and other adminstrivia instructions are almost all heavyweight ones with strong dependencies, so you would be lucky if they ran in 25 cycles where the 11 asignments may run in 5.5 cycles. But 25 cycles isn't many, so the difference is usually insignificant. Since this is initialization code, it may involve a cache miss or two, taking several hundred cycles each... Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100604165857.D28688>