Date: Tue, 1 May 2012 13:34:12 -0400 From: George Neville-Neil <gnn@neville-neil.com> To: Luigi Rizzo <rizzo@iet.unipi.it> Cc: current@freebsd.org, net@freebsd.org Subject: Re: more network performance info: ether_output() Message-ID: <265D6A39-2F35-4D31-91EF-16146E77E92E@neville-neil.com> In-Reply-To: <20120501154031.GB80942@onelab2.iet.unipi.it> References: <20120419133018.GA91364@onelab2.iet.unipi.it> <20120420190309.GA5617@onelab2.iet.unipi.it> <B6E8D175-9575-4211-92DB-A525E7F6A6A6@neville-neil.com> <20120501154031.GB80942@onelab2.iet.unipi.it>
next in thread | previous in thread | raw e-mail | index | archive | help
On May 1, 2012, at 11:40 , Luigi Rizzo wrote: > On Tue, May 01, 2012 at 10:27:42AM -0400, George Neville-Neil wrote: >>=20 >> On Apr 20, 2012, at 15:03 , Luigi Rizzo wrote: >>=20 >>> Continuing my profiling on network performance, another place >>> were we waste a lot of time is if_ethersubr.c::ether_output() >>>=20 >>> In particular, from the beginning of ether_output() to the >>> final call to ether_output_frame() the code takes slightly >>> more than 210ns on my i7-870 CPU running at 2.93 GHz + TurboBoost. >>> In particular: >>>=20 >>> - the route does not have a MAC address (lle) attached, which causes >>> arpresolve() to be called all the times. This consumes about 100ns. >>> It happens also with locally sourced TCP. >>> Using the flowtable cuts this time down to about 30-40ns >>>=20 >>> - another 100ns is spend to copy the MAC header into the mbuf, >>> and then check whether a local copy should be looped back. >>> Unfortunately the code here is a bit convoluted so the >>> header fields are copied twice, and using memcpy on the >>> individual pieces. >>>=20 >>> Note that all the above happens not just with my udp flooding >>> tests, but also with regular TCP traffic. >>=20 >> Hi Luigi, >>=20 >> I'm really glad you're working on this. I may have missed this in a = thread >> but are you tracking these somewhere so we can pick them up and fix = them? >>=20 >> Also, how are you doing the measurements. >=20 > The measurements are done with tools/tools/netrate/netsend and > kernel patches to return from sendto() at various places in the > stack (from the syscall entry point down to the device driver). > A patch is attached. You don't really need netmap to run it, > it was just a convenient place to put the variables. >=20 > I am not sure how much we can "fix", there are multiple expensive > functions on the tx path, and probably also on the rx path. >=20 > My hope at least for the tx path is that we can find out a way to = install a > "fastpath" handler in the socket. > When there is no handler installed (e.g. on the first packet or > unsupported protocols/interfaces) everything works as usual. Then > when the packet reaches the bottom of the stack, we try to update > the socket with a copy of the headers generated in the process, and > the name of the fastpath function to be called. Next transmissions > will then be able to shortcut the stack and go straight to the > device output routine. >=20 > I don't have data on the receive path or good ideas on how to proceed = -- the > advantage of the tx path is that traffic is implicitly classified, > whereas it might not be the case for incoming traffic, and = classification > might be the expensive step. >=20 > Hopefully we'll have time to discuss this next week in ottawa. Yes, I think we should. Best, George
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?265D6A39-2F35-4D31-91EF-16146E77E92E>