Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 3 Jun 2010 10:31:53 -0400
From:      George Neville-Neil <gnn@freebsd.org>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        net@freebsd.org
Subject:   Re: A slight change to tcpip_fillheaders...
Message-ID:  <54198502-A432-4FA7-9176-0AB85D809597@freebsd.org>
In-Reply-To: <20100603181439.Q27699@delplex.bde.org>
References:  <0BC7AD09-B627-4F6A-AD93-B7E794A78CA2@freebsd.org> <20100603181439.Q27699@delplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Jun 3, 2010, at 04:45 , Bruce Evans wrote:
>=20
> On Wed, 2 Jun 2010, George Neville-Neil wrote:
>=20
>> A while back another src developer mentioned that he had gotten =
better performance by changing
>> tcpip_fillheaders() in the following way:
>=20
> It's unlikely to make any significant or even measurable difference,
> but...  I got apparent large (up to 20%) timing improvments in =
networking
> by similar changes (mostly going the other way and avoiding and/or
> bzero()), but further investigation showed that the improvements were
> very mysterious and mostly unrelated to my code changes -- equal
> improvements could be obtained by adding padding or reducing code size
> in unrelated code.  This might be explained by thrashing or avoiding
> thrashing of cache(s), but the effect was too large to easily explain,
> and too mysterious to easily obtain or avoid.
>=20
>> Index: tcp_subr.c
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> --- tcp_subr.c	(revision 209083)
>> +++ tcp_subr.c	(working copy)
>> @@ -392,28 +392,19 @@
>> 		struct ip *ip;
>>=20
>> 		ip =3D (struct ip *)ip_ptr;
>> +		bzero(ip, sizeof(*ip));
>> 		ip->ip_v =3D IPVERSION;
>> 		ip->ip_hl =3D 5;
>> 		ip->ip_tos =3D inp->inp_ip_tos;
>> -		ip->ip_len =3D 0;
>> -		ip->ip_id =3D 0;
>> -		ip->ip_off =3D 0;
>> 		ip->ip_ttl =3D inp->inp_ip_ttl;
>> -		ip->ip_sum =3D 0;
>> 		ip->ip_p =3D IPPROTO_TCP;
>> 		ip->ip_src =3D inp->inp_laddr;
>> 		ip->ip_dst =3D inp->inp_faddr;
>> 	}
>> +	bzero(th, sizeof(*th));
>> 	th->th_sport =3D inp->inp_lport;
>> 	th->th_dport =3D inp->inp_fport;
>> -	th->th_seq =3D 0;
>> -	th->th_ack =3D 0;
>> -	th->th_x2 =3D 0;
>> 	th->th_off =3D 5;
>> -	th->th_flags =3D 0;
>> -	th->th_win =3D 0;
>> -	th->th_urp =3D 0;
>> -	th->th_sum =3D 0;		/* in_pseudo() is called later =
for ipv4 */
>> }
>>=20
>> /*
>=20
> The version with bzero() is a small or null pessimization assuming
> that the compiler is infinitiely smart.  If all the fields are =
initialized
> by both versions, then the result is the same, so the compiler may
> produce the same code for each.  The version with bzero() obvious
> initializes all the fields, but might not.  Howver, bzero() is extern
> in FreeBSD, and gcc doesn't attempt optimizations like inlining it
> despite it being extern, so the version with bzero() is fundamentally
> slower in theory.  My changes initially involved implementing  all
> bzero() with small fixed size using __builtin_memset().  gcc will
> inline these, hopefully using writes of a register or immediate =
constant
> value of 0.  Many i386 versions of gcc are stupid about this and use
> the slow i386 string instructions instead, but most use separate =
writes
> of 0 if the length is small, and I try to choose the small size so
> small that the separate writes are used.  For the version with the
> explicit separate writes of 0, many of the fields are smaller than
> the register size, but none are volatile so the compiler is free to
> combine the writes.  gcc does optimizations like this.  I don't know
> if it actually does them all or if it is inhibited by the writes of
> nonzero for a few fields (C requires write ordering to appear to be
> strict, but hopefully there aren't any aliasing problems in the above
> which require it to actually be strict).
>=20
>> I have tried this change with NetPIPE (NPtcp -b 100000) on a pair of =
machines using Intel igb devices and found
>> that it provides no improvement, but I am wondering if other people =
want to try this and
>> see if it improves throughput at all.  I was testing this on a =
Nehalem class machine, not sure if it
>> might help on other architectures.
>=20
> And don't forget, for such a change to be good, it needs to be good on =
most
> supported machines and not too bad on other supported machines.  This =
is
> especially hard to even test for changes related to memory and caches.

For what it's worth I checked the assembly for both versions as well.  =
The bzero
version does not inline, as you said, and the original does do a move of
0 for each and every field, again on Nehalem with our default version of
gcc.

I think that for now I will leave this alone, the code is clear either =
way,
and what I cared about was finding out if the code could be sped up.

Best,
George




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?54198502-A432-4FA7-9176-0AB85D809597>