Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 7 Aug 2013 00:55:39 +0200
From:      Luigi Rizzo <rizzo@iet.unipi.it>
To:        Andre Oppermann <andre@freebsd.org>
Cc:        Adrian Chadd <adrian@freebsd.org>, current@freebsd.org, Bryan Venteicher <bryanv@daemoninthecloset.org>, Navdeep Parhar <np@freebsd.org>, net@freebsd.org, Giuseppe Lettieri <g.lettieri@iet.unipi.it>
Subject:   Re: [net] protecting interfaces from races between control and data ?
Message-ID:  <CA%2BhQ2%2BjacQNKKkN4BbLeMQHuP1R7xJUXqNyBSr2Wn5Y3XXfbsQ@mail.gmail.com>
In-Reply-To: <520127C3.3020101@freebsd.org>
References:  <20130805082307.GA35162@onelab2.iet.unipi.it> <2034715395.855.1375714772487.JavaMail.root@daemoninthecloset.org> <CAJ-VmokT6YKPR7CXsoCavEmWv3W8urZu4eBVgKWaj9iMaVJFZg@mail.gmail.com> <CA%2BhQ2%2BhuoCCweq7fjoYmH3nyhmhb5DzukEdPSMtaJEWa8Ft0JQ@mail.gmail.com> <51FFDD1E.1000206@FreeBSD.org> <CAJ-Vmo=Q9AqdBJ0%2B4AiX4%2BWreYuZx6VGGYw=MZ4XhMB1P2yMww@mail.gmail.com> <CA%2BhQ2%2BgZTGmrBKTOAeFnNma4DQXbAy_y8NZrovpWqm_5BJTWhQ@mail.gmail.com> <5200136C.8000201@freebsd.org> <20130805215319.GA43271@onelab2.iet.unipi.it> <520127C3.3020101@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
thanks for the explanations and for experimenting with the various
alternatives.

I started this thread just to understand whether
something was already in place, and to make sure that what I
do with netmap is not worse than the situation we have now.

I guess that while the best solution comes out, a quick and
non intrusive workaround is at least follow the approach that
Bryan suggested.

cheers
luigi

On Tue, Aug 6, 2013 at 6:43 PM, Andre Oppermann <andre@freebsd.org> wrote:

> On 05.08.2013 23:53, Luigi Rizzo wrote:
>
>> On Mon, Aug 05, 2013 at 11:04:44PM +0200, Andre Oppermann wrote:
>>
>>> On 05.08.2013 19:36, Luigi Rizzo wrote:
>>>
>> ...
>>
>>>
>>> [picking a post at random to reply in this thread]
>>>
>>>> tell whether or not we should bail out).
>>>>
>>>
>>> Ideally we don't want to have any locks in the RX and TX path at all.
>>>
>>
>> Ok i have read your proposal below but there are a few things
>> I do not completely understand, below -- essentially I do not
>> understand whether the removal of IFF_DRV_RUNNING (or equivalent)
>> forces you to replace the ifp->new_tx_function() with something
>> else when you want to do an "ifconfig down"
>>
>
> Sorry, it's IFF_DRV_OACTIVE that'll go away, not IFF_DRV_RUNNING.
> It's of no use with multi-queue anyways.  Though we may get more
> differentiated states for interface availability.
>
>
>  Specifically, here are my doubts:
>>
>> - Considering that the rxq lock is rarely contended
>>    (only when the control path is active, which in your approach
>>    below requires killing and restarting the ithread),
>>    and is acquired/released only once per interrupt/batch,
>>    i am under the impression that the per-queue RX lock is
>>    not a performance problem and makes it easier to reason
>>    on the code (and does not require different approach
>>    for MSI-x and other cases).
>>
>
> There are two important things here: 1) there must only be one
> (i)thread per RX queue to prevent lock contention; 2) even with
> a non-contended lock there are effects felt by the other cores
> or CPUs like bus lock cycles which are considerable.  Stopping
> and restarting the ithread/taskqueue in those cases that require
> more involved (hardware) reconfiguration isn't much of an overhead
> compared to per-packet or per-packet-batch locking in the hot path.
>
>
>  - in the tx datapath, do you want to acquire the txq lock
>>    before or after calling ifp->new_tx_function() ?
>>    (btw how it compares to ifp->if_start or ifp->if_transmit ?)
>>
>
> Struct ifnet is going to be split into two parts, the stack owned
> part and the driver owned part.  The stack will never fumble the
> driver owned parts and the driver must only change the stack owned
> parts through accessor functions.  (This split has also been proposed
> by Juniper quite some time ago but for different reasons).
>
> The driver supplies a TX frame transmit function (mostly like if_transmit
> today) which does all locking and multi-queue handling internally (driver
> owned.  This gives driver writers the freedom to better adjust to different
> hardware communication methods as they become available, like we witnessed
> a couple of times in the past.
>
> If the driver is multi-queue capable it takes the rss-hash from the packet
> header
> to select an appropriate queue which may have its own dedicated lock.  If
> there
> is only one queue then it will obtain that lock which may see some
> contention
> when multiple cores try to send at the same time.  The driver may do more
> extensive locking, however that likely comes at the expense of performance.
>
> Typically on modern NICs the TX function will be a thin locked wrapper
> around
> the DMA enqueue function.  For or older or other kinds of interfaces it may
> implement full internal soft queuing (as the IFQ did).  An efficient
> generic
> implementation of those will be provided for driver to make use of.
>
>
> >    If you do it within the tx handler then you lose the ability
> >    of replacing the handler when you do a reconfiguration,
> >    because grabbing the tx lock in the control path does not tell
> >    you whether anyone is in the handler.
> >    If you do it outside, then the driver should also expose the locks,
> >    or some locking function.
>
> The stack calls the transmit function without any driver-specific locks
> held.
> It'll make sure though that the stack-ifnet doesn't go away in between
> probably
> by using cheap rmlocks.
>
> The drivers control path gets a function to ensure that the stack has
> drained
> all writers and it is safe to reconfigure (as in callout_drain()).  Not all
> hardware and control path changes necessarily require a reinitialization.
>
> The stack-ifnet shadows some information, like interface state, and may do
> its
> own independent SMP optimizations to avoid contention.
>
>
>  Overall, it seems to me that keeping the IFF_DRV_RUNNING
>> flag is a lot more practical, not to mention fewer modifications
>> to the code.
>>
>
> Generally we want to optimize for the fast packet path, reduce any
> friction we
> can, and take a hit on reconfig being more "expensive" as it is much less
> frequent.
>
> Mind you not all of this is worked out in detail yet and may be subject to
> further changes and refinements as more benchmarking of different
> approaches
> is performed.
>
> --
> Andre
>
> [PS: I'm currently on summer vacation and write this while having access]
>
>
>  [description of Andre's proposal below, for reference]
>>
>>         cheers
>>         luigi
>>
>>  Every queue has its own separate RX interrupt vector which is serviced by
>>> a single dedicated ithread (or taskqueue) which does while(1) for work on
>>> the RX ring only going to sleep when it reaches the end of work on the
>>> ring.
>>> It is then woken up by an interrupt to resume work.  To prevent a
>>> live-lock
>>> on the CPU it would periodically yield to allow other kernel and
>>> user-space
>>> threads to run in between.  Each queue's ithread (taskqueue) can be
>>> pinned
>>> to a particular core or float with a certain stickiness.  To reconfigure
>>> the
>>> card (when the RX ring is affected) the ithread (taskqueue) is terminated
>>> and after the changes restarted again.  This is done by the control path
>>> and allows us to run RX completely lock free because there is only ever
>>> one
>>> ithread accessing the RX queue doing away with the need for further
>>> locking
>>> synchronization.
>>>
>>
>> ok I admit i did not think of killing and restarting the ithread,
>> but i wonder how
>>
>>> RX with MSI or legacy interrupts:
>>>
>>> Here multi-queue is impeded because of some synchronization issues.
>>>  Depending
>>> on the hardware synchronization model the ithreads again do while(1) but
>>> have
>>> to be made runnable by the interrupt handler when new work has been
>>> indicated.
>>>
>>> TX in general:
>>>
>>> This is a bit more tricky as we have the hard requirement of packet
>>> ordering
>>> for individual sessions (tcp and others).  That means we must use the
>>> same
>>> queue for all packets of the same session.  This makes running TX
>>> non-locked
>>> a bit tricky because when we pin a TX queue to a core we must switch to
>>> that
>>> core first before adding anything to the queue.  If the packet was
>>> generated
>>> on that core there is no overhead, OTOH if not there is the scheduler
>>> over-
>>> head and some cache losses.  Ensuring that a the entire TX path, possibly
>>> including user-space (write et al) happens on the same core is difficult
>>> and
>>> may have its own inefficiencies.  The number of TX queue vs. number of
>>> cores
>>> is another point of contention.  To make the 1:1 scheme work well, one
>>> must
>>> have as many queues as cores.
>>>
>>> A more scalable and generic approach doesn't bind TX queues to any
>>> particular
>>> core and is covers each by its own lock to protect the TX ring enqueue.
>>>  To
>>> prevent false lock cache line sharing each TX structure should be on its
>>> own
>>> cache line.  As each session has an affinity hash they become distributed
>>> across the TX queues significantly reducing contention.
>>>
>>> The TX complete handling freeing the mbuf(s) is done the same way as for
>>> the
>>> RX ring with a dedicated ithread (taskqueue) for each.  Also bpf should
>>> hook
>>> in here (the packet has been sent) instead of at the TX enqueue time.
>>>
>>> The whole concept of IFF_DRV_RUNNING will go away along with the IFQ
>>> macros.
>>> Each driver then provides a direct TX function pointer which is put into
>>> a
>>> decoupled ifnet TX field for use by the stack.  Under normal operation
>>> all
>>> interface TX goes directly into TX DMA ring.  The particular multi-queue
>>> and locking model is decided by the driver.  The kernel provides a couple
>>> of common optimized abstractions for use by all drivers that want/need it
>>> to avoid code and functionality duplication.  When things like ALTQ are
>>> activated on an interface the ifnet TX function pointer is replaced with
>>> the appropriate intermediate function pointer which eventually will call
>>> the drivers TX function.  The intermediate TX packet handler (ALTQ) can
>>> do its own additional locking on top of the drivers TX locking.
>>>
>>> Control path:
>>>
>>> The control path is separate from the high speed TX and RX data paths.
>>> It has its own overall lock.  Multiple locks typically don't make sense
>>> here.  If it needs to modify, reconfigure, or newly set up the TX or RX
>>> rings then it has to stop the RX ithreads (taskqueues) and to lock the
>>> TX queue locks before it can do that.  After the changes are made and
>>> stable packet processing can continue.
>>>
>>> I've adjusted / heavily modified fxp, bge and igb drivers in my
>>> tcp_workqueue
>>> branch to test these concepts.  Not all of it is committed or fully up
>>> to date.
>>>
>>> --
>>> Andre
>>>
>>>
>>
>>
>


-- 
-----------------------------------------+-------------------------------
 Prof. Luigi RIZZO, rizzo@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/        . Universita` di Pisa
 TEL      +39-050-2211611               . via Diotisalvi 2
 Mobile   +39-338-6809875               . 56122 PISA (Italy)
-----------------------------------------+-------------------------------



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CA%2BhQ2%2BjacQNKKkN4BbLeMQHuP1R7xJUXqNyBSr2Wn5Y3XXfbsQ>