Date: Thu, 18 Oct 2012 21:04:20 +0200 From: Luigi Rizzo <rizzo@iet.unipi.it> To: Andre Oppermann <oppermann@networx.ch> Cc: "Alexander V. Chernikov" <melifaro@freebsd.org>, Jack Vogel <jfvogel@gmail.com>, net@freebsd.org Subject: Re: ixgbe & if_igb RX ring locking Message-ID: <20121018190420.GB98348@onelab2.iet.unipi.it> In-Reply-To: <5080020E.1010603@networx.ch> References: <5079A9A1.4070403@FreeBSD.org> <20121013182223.GA73341@onelab2.iet.unipi.it> <5080020E.1010603@networx.ch>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Oct 18, 2012 at 03:20:14PM +0200, Andre Oppermann wrote: > On 13.10.2012 20:22, Luigi Rizzo wrote: > >On Sat, Oct 13, 2012 at 09:49:21PM +0400, Alexander V. Chernikov wrote: > >>Hello list! > >> > >> > >>Packets receiving code for both ixgbe and if_igb looks like the following: > >> > >> > >>ixgbe_msix_que > >> > >>-- ixgbe_rxeof() > >> { > >> IXGBE_RX_LOCK(rxr); > >> while > >> { > >> get_packet; > >> > >> -- ixgbe_rx_input() > >> { > >> ++ IXGBE_RX_UNLOCK(rxr); > >> if_input(packet); > >> ++ IXGBE_RX_LOCK(rxr); > >> } > >> > >> } > >> IXGBE_RX_UNLOCK(rxr); > >> } > >> > >>Lines marked with ++ appeared in r209068(igb) and r217593(ixgbe). > >> > >>These lines probably do LORs masking (if any) well. > >>However, such change introduce quite significant performance drop: > >> > >>On my routing setup (nearly the same from previous -Intel 10G thread in > >>-net) adding lock/unlock causes 2.8MPPS decrease to 2.3MPPS which is > >>nearly 20%. > > > >one option could be (same as it is done in the timer > >routine in dummynet) to build a list of all the packets > >that need to be sent to if_input(), and then call > >if_input with the entire list outside the lock. > > > >It would be even easier if we modify the various *_input() > >routines to handle a list of mbufs instead of just one. > > Not really. You'd just run into tons of layering complexity. > Somewhere the decomposition and serialization has to be done. > > Perhaps the right place is to dequeue a batch of packets from > the HW ring and then have a task/thread send it up the stack > one by one. this is exactly what the dummynet code does -- collect a batch of packets, release the lock, and then loop over the batch to feed ip_input/ip_output or other things. My point was, however, that instead of having to write an explicit loop in all clients of ether_input(), we could make ether_input() itself (or ether_input_batch(), does not really matter) able to handle the batch and in turn call the main function. The frontend then could apply some smarts to try and group packets (not too different from TCP Receive Side Coalescing/Large Receive Offload) within the batch, and this could be done without locking/unlocking on each packet. Furthermore, chances are that you can pass batches from one layer to the next one in this way, something that wouldn't work if your workflow can only handle one packet at a time. And finally, the good thing is that implementation can be incremental and on a case-by-case basis. The VALE bridge uses this strategy http://info.iet.unipi.it/~luigi/vale/ and moving batches instead of single packets brings the forwarding rate from 4 to 17~Mpps. At high rates, it really pays off. cheers luigi > -- > Andre >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20121018190420.GB98348>