Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 4 Jun 2010 17:30:47 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        George Neville-Neil <gnn@freebsd.org>
Cc:        net@freebsd.org
Subject:   Re: A slight change to tcpip_fillheaders...
Message-ID:  <20100604165857.D28688@delplex.bde.org>
In-Reply-To: <54198502-A432-4FA7-9176-0AB85D809597@freebsd.org>
References:  <0BC7AD09-B627-4F6A-AD93-B7E794A78CA2@freebsd.org> <20100603181439.Q27699@delplex.bde.org> <54198502-A432-4FA7-9176-0AB85D809597@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 3 Jun 2010, George Neville-Neil wrote:

> For what it's worth I checked the assembly for both versions as well.  The bzero
> version does not inline, as you said, and the original does do a move of
> 0 for each and every field, again on Nehalem with our default version of
> gcc.
>
> I think that for now I will leave this alone, the code is clear either way,
> and what I cared about was finding out if the code could be sped up.

I couldn't find any options to make gcc-4.2.1 coalesce the assignments in the
following simple example:

%%%
struct foo {
 	char x;
 	char y;
};

xx(struct foo *fp)
{
 	fp->x = 0;
 	fp->y = 0;
}
%%%

The non-coalesced version may be a bottleneck in the instruction stream
in some relatively rare cases.  The worst case seems to be non-coalescing
8 8-bit variables on a 64-bit arch.  (gcc does do the coalescing for
bit-fields, else the worst cast would be 64 assignments of 1-bit bit-fields
generating 3*64 micro-instructions (3 for each assignment to preserve
nearby bits).)  But since there are no dependencies between these assignments
they are easy to schedule, and 8 instructions isn't many (they probably take
4 cycles).

struct ip has 11 separate fields (after combining the bit-fields).  11
instructions for these is a few, the extern bzero() takes almost that
many just to call; then on i386 it takes 12 instructions internally
for administrivia and 5 instructions internally to do the work; on
amd64 it takes 7 instructions interally for administivia and 6
instructions internally to do the work (amd64 bzero actually does more
assignments internally -- ones of size 8,8,1,1,1,1 instead of ones of
size 4,4,4,4,4; it could do fewer, but only at a cost of more for
administrivia).  The function call instructions and other adminstrivia
instructions are almost all heavyweight ones with strong dependencies,
so you would be lucky if they ran in 25 cycles where the 11 asignments
may run in 5.5 cycles.  But 25 cycles isn't many, so the difference is
usually insignificant.  Since this is initialization code, it may involve
a cache miss or two, taking several hundred cycles each...

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100604165857.D28688>