From owner-freebsd-net@FreeBSD.ORG Thu Jun 3 08:45:27 2010 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B9C44106567C for ; Thu, 3 Jun 2010 08:45:27 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail07.syd.optusnet.com.au (mail07.syd.optusnet.com.au [211.29.132.188]) by mx1.freebsd.org (Postfix) with ESMTP id 539538FC1E for ; Thu, 3 Jun 2010 08:45:27 +0000 (UTC) Received: from c122-106-160-243.carlnfd1.nsw.optusnet.com.au (c122-106-160-243.carlnfd1.nsw.optusnet.com.au [122.106.160.243]) by mail07.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id o538jNrp011537 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 3 Jun 2010 18:45:25 +1000 Date: Thu, 3 Jun 2010 18:45:22 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: George Neville-Neil In-Reply-To: <0BC7AD09-B627-4F6A-AD93-B7E794A78CA2@freebsd.org> Message-ID: <20100603181439.Q27699@delplex.bde.org> References: <0BC7AD09-B627-4F6A-AD93-B7E794A78CA2@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: net@freebsd.org Subject: Re: A slight change to tcpip_fillheaders... X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 03 Jun 2010 08:45:27 -0000 On Wed, 2 Jun 2010, George Neville-Neil wrote: > A while back another src developer mentioned that he had gotten better performance by changing > tcpip_fillheaders() in the following way: It's unlikely to make any significant or even measurable difference, but... I got apparent large (up to 20%) timing improvments in networking by similar changes (mostly going the other way and avoiding and/or bzero()), but further investigation showed that the improvements were very mysterious and mostly unrelated to my code changes -- equal improvements could be obtained by adding padding or reducing code size in unrelated code. This might be explained by thrashing or avoiding thrashing of cache(s), but the effect was too large to easily explain, and too mysterious to easily obtain or avoid. > Index: tcp_subr.c > =================================================================== > --- tcp_subr.c (revision 209083) > +++ tcp_subr.c (working copy) > @@ -392,28 +392,19 @@ > struct ip *ip; > > ip = (struct ip *)ip_ptr; > + bzero(ip, sizeof(*ip)); > ip->ip_v = IPVERSION; > ip->ip_hl = 5; > ip->ip_tos = inp->inp_ip_tos; > - ip->ip_len = 0; > - ip->ip_id = 0; > - ip->ip_off = 0; > ip->ip_ttl = inp->inp_ip_ttl; > - ip->ip_sum = 0; > ip->ip_p = IPPROTO_TCP; > ip->ip_src = inp->inp_laddr; > ip->ip_dst = inp->inp_faddr; > } > + bzero(th, sizeof(*th)); > th->th_sport = inp->inp_lport; > th->th_dport = inp->inp_fport; > - th->th_seq = 0; > - th->th_ack = 0; > - th->th_x2 = 0; > th->th_off = 5; > - th->th_flags = 0; > - th->th_win = 0; > - th->th_urp = 0; > - th->th_sum = 0; /* in_pseudo() is called later for ipv4 */ > } > > /* The version with bzero() is a small or null pessimization assuming that the compiler is infinitiely smart. If all the fields are initialized by both versions, then the result is the same, so the compiler may produce the same code for each. The version with bzero() obvious initializes all the fields, but might not. Howver, bzero() is extern in FreeBSD, and gcc doesn't attempt optimizations like inlining it despite it being extern, so the version with bzero() is fundamentally slower in theory. My changes initially involved implementing all bzero() with small fixed size using __builtin_memset(). gcc will inline these, hopefully using writes of a register or immediate constant value of 0. Many i386 versions of gcc are stupid about this and use the slow i386 string instructions instead, but most use separate writes of 0 if the length is small, and I try to choose the small size so small that the separate writes are used. For the version with the explicit separate writes of 0, many of the fields are smaller than the register size, but none are volatile so the compiler is free to combine the writes. gcc does optimizations like this. I don't know if it actually does them all or if it is inhibited by the writes of nonzero for a few fields (C requires write ordering to appear to be strict, but hopefully there aren't any aliasing problems in the above which require it to actually be strict). > I have tried this change with NetPIPE (NPtcp -b 100000) on a pair of machines using Intel igb devices and found > that it provides no improvement, but I am wondering if other people want to try this and > see if it improves throughput at all. I was testing this on a Nehalem class machine, not sure if it > might help on other architectures. And don't forget, for such a change to be good, it needs to be good on most supported machines and not too bad on other supported machines. This is especially hard to even test for changes related to memory and caches. Bruce