Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 1 May 2012 13:34:12 -0400
From:      George Neville-Neil <gnn@neville-neil.com>
To:        Luigi Rizzo <rizzo@iet.unipi.it>
Cc:        current@freebsd.org, net@freebsd.org
Subject:   Re: more network performance info: ether_output()
Message-ID:  <265D6A39-2F35-4D31-91EF-16146E77E92E@neville-neil.com>
In-Reply-To: <20120501154031.GB80942@onelab2.iet.unipi.it>
References:  <20120419133018.GA91364@onelab2.iet.unipi.it> <20120420190309.GA5617@onelab2.iet.unipi.it> <B6E8D175-9575-4211-92DB-A525E7F6A6A6@neville-neil.com> <20120501154031.GB80942@onelab2.iet.unipi.it>

next in thread | previous in thread | raw e-mail | index | archive | help

On May 1, 2012, at 11:40 , Luigi Rizzo wrote:

> On Tue, May 01, 2012 at 10:27:42AM -0400, George Neville-Neil wrote:
>>=20
>> On Apr 20, 2012, at 15:03 , Luigi Rizzo wrote:
>>=20
>>> Continuing my profiling on network performance, another place
>>> were we waste a lot of time is if_ethersubr.c::ether_output()
>>>=20
>>> In particular, from the beginning of ether_output() to the
>>> final call to ether_output_frame() the code takes slightly
>>> more than 210ns on my i7-870 CPU running at 2.93 GHz + TurboBoost.
>>> In particular:
>>>=20
>>> - the route does not have a MAC address (lle) attached, which causes
>>> arpresolve() to be called all the times. This consumes about 100ns.
>>> It happens also with locally sourced TCP.
>>> Using the flowtable cuts this time down to about 30-40ns
>>>=20
>>> - another 100ns is spend to copy the MAC header into the mbuf,
>>> and then check whether a local copy should be looped back.
>>> Unfortunately the code here is a bit convoluted so the
>>> header fields are copied twice, and using memcpy on the
>>> individual pieces.
>>>=20
>>> Note that all the above happens not just with my udp flooding
>>> tests, but also with regular TCP traffic.
>>=20
>> Hi Luigi,
>>=20
>> I'm really glad you're working on this.  I may have missed this in a =
thread
>> but are you tracking these somewhere so we can pick them up and fix =
them?
>>=20
>> Also, how are you doing the measurements.
>=20
> The measurements are done with tools/tools/netrate/netsend and
> kernel patches to return from sendto() at various places in the
> stack (from the syscall entry point down to the device driver).
> A patch is attached. You don't really need netmap to run it,
> it was just a convenient place to put the variables.
>=20
> I am not sure how much we can "fix", there are multiple expensive
> functions on the tx path, and probably also on the rx path.
>=20
> My hope at least for the tx path is that we can find out a way to =
install a
> "fastpath" handler in the socket.
> When there is no handler installed (e.g. on the first packet or
> unsupported protocols/interfaces) everything works as usual. Then
> when the packet reaches the bottom of the stack, we try to update
> the socket with a copy of the headers generated in the process, and
> the name of the fastpath function to be called.  Next transmissions
> will then be able to shortcut the stack and go straight to the
> device output routine.
>=20
> I don't have data on the receive path or good ideas on how to proceed =
-- the
> advantage of the tx path is that traffic is implicitly classified,
> whereas it might not be the case for incoming traffic, and =
classification
> might be the expensive step.
>=20
> Hopefully we'll have time to discuss this next week in ottawa.

Yes, I think we should.

Best,
George




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?265D6A39-2F35-4D31-91EF-16146E77E92E>