Date: Thu, 3 Jun 2010 10:31:53 -0400 From: George Neville-Neil <gnn@freebsd.org> To: Bruce Evans <brde@optusnet.com.au> Cc: net@freebsd.org Subject: Re: A slight change to tcpip_fillheaders... Message-ID: <54198502-A432-4FA7-9176-0AB85D809597@freebsd.org> In-Reply-To: <20100603181439.Q27699@delplex.bde.org> References: <0BC7AD09-B627-4F6A-AD93-B7E794A78CA2@freebsd.org> <20100603181439.Q27699@delplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Jun 3, 2010, at 04:45 , Bruce Evans wrote: >=20 > On Wed, 2 Jun 2010, George Neville-Neil wrote: >=20 >> A while back another src developer mentioned that he had gotten = better performance by changing >> tcpip_fillheaders() in the following way: >=20 > It's unlikely to make any significant or even measurable difference, > but... I got apparent large (up to 20%) timing improvments in = networking > by similar changes (mostly going the other way and avoiding and/or > bzero()), but further investigation showed that the improvements were > very mysterious and mostly unrelated to my code changes -- equal > improvements could be obtained by adding padding or reducing code size > in unrelated code. This might be explained by thrashing or avoiding > thrashing of cache(s), but the effect was too large to easily explain, > and too mysterious to easily obtain or avoid. >=20 >> Index: tcp_subr.c >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> --- tcp_subr.c (revision 209083) >> +++ tcp_subr.c (working copy) >> @@ -392,28 +392,19 @@ >> struct ip *ip; >>=20 >> ip =3D (struct ip *)ip_ptr; >> + bzero(ip, sizeof(*ip)); >> ip->ip_v =3D IPVERSION; >> ip->ip_hl =3D 5; >> ip->ip_tos =3D inp->inp_ip_tos; >> - ip->ip_len =3D 0; >> - ip->ip_id =3D 0; >> - ip->ip_off =3D 0; >> ip->ip_ttl =3D inp->inp_ip_ttl; >> - ip->ip_sum =3D 0; >> ip->ip_p =3D IPPROTO_TCP; >> ip->ip_src =3D inp->inp_laddr; >> ip->ip_dst =3D inp->inp_faddr; >> } >> + bzero(th, sizeof(*th)); >> th->th_sport =3D inp->inp_lport; >> th->th_dport =3D inp->inp_fport; >> - th->th_seq =3D 0; >> - th->th_ack =3D 0; >> - th->th_x2 =3D 0; >> th->th_off =3D 5; >> - th->th_flags =3D 0; >> - th->th_win =3D 0; >> - th->th_urp =3D 0; >> - th->th_sum =3D 0; /* in_pseudo() is called later = for ipv4 */ >> } >>=20 >> /* >=20 > The version with bzero() is a small or null pessimization assuming > that the compiler is infinitiely smart. If all the fields are = initialized > by both versions, then the result is the same, so the compiler may > produce the same code for each. The version with bzero() obvious > initializes all the fields, but might not. Howver, bzero() is extern > in FreeBSD, and gcc doesn't attempt optimizations like inlining it > despite it being extern, so the version with bzero() is fundamentally > slower in theory. My changes initially involved implementing all > bzero() with small fixed size using __builtin_memset(). gcc will > inline these, hopefully using writes of a register or immediate = constant > value of 0. Many i386 versions of gcc are stupid about this and use > the slow i386 string instructions instead, but most use separate = writes > of 0 if the length is small, and I try to choose the small size so > small that the separate writes are used. For the version with the > explicit separate writes of 0, many of the fields are smaller than > the register size, but none are volatile so the compiler is free to > combine the writes. gcc does optimizations like this. I don't know > if it actually does them all or if it is inhibited by the writes of > nonzero for a few fields (C requires write ordering to appear to be > strict, but hopefully there aren't any aliasing problems in the above > which require it to actually be strict). >=20 >> I have tried this change with NetPIPE (NPtcp -b 100000) on a pair of = machines using Intel igb devices and found >> that it provides no improvement, but I am wondering if other people = want to try this and >> see if it improves throughput at all. I was testing this on a = Nehalem class machine, not sure if it >> might help on other architectures. >=20 > And don't forget, for such a change to be good, it needs to be good on = most > supported machines and not too bad on other supported machines. This = is > especially hard to even test for changes related to memory and caches. For what it's worth I checked the assembly for both versions as well. = The bzero version does not inline, as you said, and the original does do a move of 0 for each and every field, again on Nehalem with our default version of gcc. I think that for now I will leave this alone, the code is clear either = way, and what I cared about was finding out if the code could be sped up. Best, George
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?54198502-A432-4FA7-9176-0AB85D809597>