Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 18 Nov 2012 22:49:39 +0100
From:      Andre Oppermann <andre@freebsd.org>
To:        Barney Cordoba <barney_cordoba@yahoo.com>
Cc:        freebsd-net@freebsd.org, John Baldwin <jhb@freebsd.org>
Subject:   Re: Latency issues with buf_ring
Message-ID:  <50A957F3.7020003@freebsd.org>
In-Reply-To: <1353259441.19423.YahooMailClassic@web121605.mail.ne1.yahoo.com>
References:  <1353259441.19423.YahooMailClassic@web121605.mail.ne1.yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On 18.11.2012 18:24, Barney Cordoba wrote:
>
>
> --- On Thu, 1/19/12, John Baldwin <jhb@freebsd.org> wrote:
>
>> From: John Baldwin <jhb@freebsd.org>
>> Subject: Latency issues with buf_ring
>> To: net@freebsd.org
>> Cc: "Ed Maste" <emaste@freebsd.org>, "Navdeep Parhar" <np@freebsd.org>
>> Date: Thursday, January 19, 2012, 11:41 AM
>> The current buf_ring usage in various
>> NIC drivers has a race that can
>> result in really high packet latencies in some cases.
>> Specifically,
>> the common pattern in an if_transmit routine is to use a
>> try-lock on
>> the queue and if that fails enqueue the packet in the
>> buf_ring and
>> return.  The race, of course, is that the thread
>> holding the lock
>> might have just finished checking the buf_ring and found it
>> empty and
>> be in the process of releasing the lock when the original
>> thread fails
>> the try lock.  If this happens, then the packet queued
>> by the first
>> thread will be stalled until another thread tries to
>> transmit packets
>> for that queue.  Some drivers attempt to handle this
>> race (igb(4)
>> schedules a task to kick the transmit queue if the try lock
>> fails) and
>> others don't (cxgb(4) doesn't handle it at all).  At
>> work this race
>> was triggered very often after upgrading from 7 to 8 with
>> bursty
>> traffic and caused numerous problems, so it is not a rare
>> occurrence
>> and needs to be addressed.
>>
>> (Note, all patches mentioned are against 8)
>>
>> The first hack I tried to use was to simply always lock the
>> queue after
>> the drbr enqueue if the try lock failed and then drain the
>> queue if
>> needed (www.freebsd.org/~jhb/patches/try_fail.patch).
>> While this fixed
>> my latency problems, it would seem that this breaks other
>> workloads
>> that the drbr design is trying to optimize.
>>
>> After further hacking what I came up with was a variant of
>> drbr_enqueue()
>> that would atomically set a 'pending' flag.  During the
>> enqueue operation.
>> The first thread to fail the try lock sets this flag (it is
>> told that it
>> set the flag by a new return value (EINPROGRESS) from the
>> enqueue call).
>> The pending thread then explicitly clears the flag once it
>> acquires the
>> queue lock.  This should prevent multiple threads from
>> stacking up on the
>> queue lock so that if multiple threads are dumping packets
>> into the ring
>> concurrently all but two (the one draining the queue
>> currently and the
>> one waiting for the lock) can continue to drain the
>> queue.  One downside
>> of this approach though is that each driver has to be
>> changed to make
>> an explicit call to clear the pending flag after grabbing
>> the queue lock
>> if the try lock fails.  This is what I am currently
>> running in production
>> (www.freebsd.org/~jhb/patches/try_fail3.patch).
>>
>> However, this still results in a lot of duplicated code in
>> each driver
>> that wants to support multiq.  Several folks have
>> expressed a desire
>> to move in a direction where the stack has explicit
>> knowledge of
>> transmit queues allowing us to hoist some of this duplicated
>> code out
>> of the drivers and up into the calling layer.  After
>> discussing this a
>> bit with Navdeep (np@), the approach I am looking at is to
>> alter the
>> buf_ring code flow a bit to more closely model the older
>> code-flow
>> with IFQ and if_start methods.  That is, have the
>> if_transmit methods
>> always enqueue each packet that arrives to the buf_ring and
>> then to
>> call an if_start-like method that drains a specific transmit
>> queue.
>> This approach simplifies a fair bit of driver code and means
>> we can
>> potentially move the enqueue, etc. bits up into the calling
>> layer and
>> instead have drivers provide the per-transmit queue start
>> routine as
>> the direct function pointer to the upper layers ala
>> if_start.
>>
>> However, we would still need a way to close the latency
>> race.  I've
>> attempted to do that by inverting my previous 'thread
>> pending' flag.
>> Instead, I make the buf_ring store a 'busy' flag.  This
>> flag is
>> managed by the single-consumer buf_ring dequeue method
>> (that
>> drbr_dequeue() uses).  It is set to true when a packet
>> is removed from
>> the queue while there are more packets pending.
>> Conversely, if there
>> are no other pending packets then it is set to false.
>> The assumption
>> is that once a thread starts draining the queue, it will not
>> stop
>> until the queue is empty (or if it has to stop for some
>> other reason
>> such as the transmit ring being full, the driver will
>> restart draining
>> of the queue until it is empty, e.g. after it receives a
>> transmit
>> completion interrupt).  Now when the if_transmit
>> routine enqueues the
>> packet, it will get either a real error, 0 if the packet was
>> enqueued
>> and the queue was not idle, or EINPROGRESS if the packet was
>> enqueued
>> and the queue was busy.  For the EINPROGRESS case the
>> if_transmit
>> routine just returns success.  For the 0 case it does a
>> blocking lock
>> on the queue lock and calls the queue's start routine (note
>> that this
>> means that the busy flag is similar to the old OACTIVE
>> interface
>> flag).  This does mean that in some cases you may have
>> one thread that
>> is sending what was the last packet in the buf_ring holding
>> the lock
>> when another thread blocks, and that the first thread will
>> see the new
>> packet when it loops back around so that the second thread
>> is wasting
>> it's time spinning, but in the common case I believe it will
>> give the
>> same parallelism as the current code.  OTOH, there is
>> nothing to
>> prevent multiple threads from "stacking up" in the new
>> approach.  At
>> least the try_fail3 patch ensured only one thread at a time
>> would ever
>> potentially block on the queue lock.
>>
>> Another approach might be to replace the 'busy' flag with
>> the 'thread
>> pending' flag from try_fail3.patch, but to clear the 'thread
>> pending'
>> flag anytime the dequeue method is called rather than using
>> an
>> explicit 'clear pending' method.  (Hadn't thought of
>> that until
>> writing this e-mail.)  That would prevent multiple
>> threads from
>> waiting on the queue lock perhaps.
>>
>> Note that the 'busy' approach (or the modification I
>> mentioned above)
>> does rely on the assumption I stated above, i.e. once a
>> driver starts
>> draining a queue, it will drain it until empty unless it
>> hits an
>> "error" condition (link went down, transmit ring full,
>> etc.).  If it
>> hits an "error" condition, the driver is responsible for
>> restarting
>> transmit when the condition clears.  I believe our
>> drivers already
>> work this way now.
>>
>> The 'busy' patch is at http://www.freebsd.org/~jhb/patches/drbr.patch
>>
>> --
>> John Baldwin
>
> Q1: Has this been corrected?

I don't know.

> Q2: Are there any case studies or benchmarks for buf_ring, or it is just
> blindly being used because someone claimed it was better and offered it
> for free? One of the points of locking is to avoid race conditions, so
 > the fact that you have races in a supposed lock-less scheme seems more
 > than just ironic.

At the moment our entire ifnet/driver handoff is not optimal.  Especially
the double buffering in the DMA rings and the IFQ (be it the old one or
buf_ring) is troubling.  See "Buffer Bloat" for one aspect of it.  While
hacking up a couple of drivers for automatic irq/polling experimenting
I've come up with a couple of ideas in this area that I'm implementing
right now.  When it's ready meaningful benchmarks will be run to assess
the impact.

-- 
Andre




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?50A957F3.7020003>