Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 18 Oct 2012 21:04:20 +0200
From:      Luigi Rizzo <rizzo@iet.unipi.it>
To:        Andre Oppermann <oppermann@networx.ch>
Cc:        "Alexander V. Chernikov" <melifaro@freebsd.org>, Jack Vogel <jfvogel@gmail.com>, net@freebsd.org
Subject:   Re: ixgbe & if_igb RX ring locking
Message-ID:  <20121018190420.GB98348@onelab2.iet.unipi.it>
In-Reply-To: <5080020E.1010603@networx.ch>
References:  <5079A9A1.4070403@FreeBSD.org> <20121013182223.GA73341@onelab2.iet.unipi.it> <5080020E.1010603@networx.ch>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Oct 18, 2012 at 03:20:14PM +0200, Andre Oppermann wrote:
> On 13.10.2012 20:22, Luigi Rizzo wrote:
> >On Sat, Oct 13, 2012 at 09:49:21PM +0400, Alexander V. Chernikov wrote:
> >>Hello list!
> >>
> >>
> >>Packets receiving code for both ixgbe and if_igb looks like the following:
> >>
> >>
> >>ixgbe_msix_que
> >>
> >>-- ixgbe_rxeof()
> >>    {
> >>       IXGBE_RX_LOCK(rxr);
> >>         while
> >>         {
> >>            get_packet;
> >>
> >>            -- ixgbe_rx_input()
> >>               {
> >>                  ++ IXGBE_RX_UNLOCK(rxr);
> >>                  if_input(packet);
> >>                  ++ IXGBE_RX_LOCK(rxr);
> >>               }
> >>
> >>         }
> >>       IXGBE_RX_UNLOCK(rxr);
> >>     }
> >>
> >>Lines marked with ++ appeared in r209068(igb) and r217593(ixgbe).
> >>
> >>These lines probably do LORs masking (if any) well.
> >>However, such change introduce quite significant performance drop:
> >>
> >>On my routing setup (nearly the same from previous -Intel 10G thread in
> >>-net) adding lock/unlock causes 2.8MPPS decrease to 2.3MPPS which is
> >>nearly 20%.
> >
> >one option could be (same as it is done in the timer
> >routine in dummynet) to build a list of all the packets
> >that need to be sent to if_input(), and then call
> >if_input with the entire list outside the lock.
> >
> >It would be even easier if we modify the various *_input()
> >routines to handle a list of mbufs instead of just one.
> 
> Not really. You'd just run into tons of layering complexity.
> Somewhere the decomposition and serialization has to be done.
> 
> Perhaps the right place is to dequeue a batch of packets from
> the HW ring and then have a task/thread send it up the stack
> one by one.

this is exactly what the dummynet code does -- collect a batch
of packets, release the lock, and then loop over the batch to feed
ip_input/ip_output or other things.

My point was, however, that instead of having to write an explicit
loop in all clients of ether_input(), we could make ether_input()
itself (or ether_input_batch(), does not really matter)
able to handle the batch and in turn call the main function.
The frontend then could apply some smarts to try and group
packets (not too different from TCP Receive Side Coalescing/Large
Receive Offload) within the batch, and this could be done
without locking/unlocking on each packet.

Furthermore, chances are that you can pass batches from one layer
to the next one in this way, something that wouldn't work if your
workflow can only handle one packet at a time.

And finally, the good thing is that implementation can be
incremental and on a case-by-case basis.

The VALE bridge uses this strategy
http://info.iet.unipi.it/~luigi/vale/
and moving batches instead of single packets brings the
forwarding rate from 4 to 17~Mpps.
At high rates, it really pays off.

cheers
luigi

> -- 
> Andre
> 



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20121018190420.GB98348>