Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 4 Dec 2012 11:08:17 -0500
From:      John Baldwin <jhb@freebsd.org>
To:        Barney Cordoba <barney_cordoba@yahoo.com>
Cc:        freebsd-net@freebsd.org
Subject:   Re: Latency issues with buf_ring
Message-ID:  <201212041108.17645.jhb@freebsd.org>
In-Reply-To: <1353259441.19423.YahooMailClassic@web121605.mail.ne1.yahoo.com>
References:  <1353259441.19423.YahooMailClassic@web121605.mail.ne1.yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sunday, November 18, 2012 12:24:01 pm Barney Cordoba wrote:
> 
> --- On Thu, 1/19/12, John Baldwin <jhb@freebsd.org> wrote:
> 
> > From: John Baldwin <jhb@freebsd.org>
> > Subject: Latency issues with buf_ring
> > To: net@freebsd.org
> > Cc: "Ed Maste" <emaste@freebsd.org>, "Navdeep Parhar" <np@freebsd.org>
> > Date: Thursday, January 19, 2012, 11:41 AM
> > The current buf_ring usage in various
> > NIC drivers has a race that can
> > result in really high packet latencies in some cases. 
> > Specifically,
> > the common pattern in an if_transmit routine is to use a
> > try-lock on
> > the queue and if that fails enqueue the packet in the
> > buf_ring and
> > return.  The race, of course, is that the thread
> > holding the lock
> > might have just finished checking the buf_ring and found it
> > empty and
> > be in the process of releasing the lock when the original
> > thread fails
> > the try lock.  If this happens, then the packet queued
> > by the first
> > thread will be stalled until another thread tries to
> > transmit packets
> > for that queue.  Some drivers attempt to handle this
> > race (igb(4)
> > schedules a task to kick the transmit queue if the try lock
> > fails) and
> > others don't (cxgb(4) doesn't handle it at all).  At
> > work this race
> > was triggered very often after upgrading from 7 to 8 with
> > bursty
> > traffic and caused numerous problems, so it is not a rare
> > occurrence
> > and needs to be addressed.
> > 
> > (Note, all patches mentioned are against 8)
> > 
> > The first hack I tried to use was to simply always lock the
> > queue after
> > the drbr enqueue if the try lock failed and then drain the
> > queue if
> > needed (www.freebsd.org/~jhb/patches/try_fail.patch). 
> > While this fixed
> > my latency problems, it would seem that this breaks other
> > workloads
> > that the drbr design is trying to optimize.
> > 
> > After further hacking what I came up with was a variant of
> > drbr_enqueue()
> > that would atomically set a 'pending' flag.  During the
> > enqueue operation.
> > The first thread to fail the try lock sets this flag (it is
> > told that it
> > set the flag by a new return value (EINPROGRESS) from the
> > enqueue call).
> > The pending thread then explicitly clears the flag once it
> > acquires the
> > queue lock.  This should prevent multiple threads from
> > stacking up on the
> > queue lock so that if multiple threads are dumping packets
> > into the ring
> > concurrently all but two (the one draining the queue
> > currently and the
> > one waiting for the lock) can continue to drain the
> > queue.  One downside
> > of this approach though is that each driver has to be
> > changed to make
> > an explicit call to clear the pending flag after grabbing
> > the queue lock
> > if the try lock fails.  This is what I am currently
> > running in production
> > (www.freebsd.org/~jhb/patches/try_fail3.patch).
> > 
> > However, this still results in a lot of duplicated code in
> > each driver
> > that wants to support multiq.  Several folks have
> > expressed a desire
> > to move in a direction where the stack has explicit
> > knowledge of
> > transmit queues allowing us to hoist some of this duplicated
> > code out
> > of the drivers and up into the calling layer.  After
> > discussing this a
> > bit with Navdeep (np@), the approach I am looking at is to
> > alter the
> > buf_ring code flow a bit to more closely model the older
> > code-flow
> > with IFQ and if_start methods.  That is, have the
> > if_transmit methods
> > always enqueue each packet that arrives to the buf_ring and
> > then to
> > call an if_start-like method that drains a specific transmit
> > queue.
> > This approach simplifies a fair bit of driver code and means
> > we can
> > potentially move the enqueue, etc. bits up into the calling
> > layer and
> > instead have drivers provide the per-transmit queue start
> > routine as
> > the direct function pointer to the upper layers ala
> > if_start.
> > 
> > However, we would still need a way to close the latency
> > race.  I've
> > attempted to do that by inverting my previous 'thread
> > pending' flag.
> > Instead, I make the buf_ring store a 'busy' flag.  This
> > flag is
> > managed by the single-consumer buf_ring dequeue method
> > (that
> > drbr_dequeue() uses).  It is set to true when a packet
> > is removed from
> > the queue while there are more packets pending. 
> > Conversely, if there
> > are no other pending packets then it is set to false. 
> > The assumption
> > is that once a thread starts draining the queue, it will not
> > stop
> > until the queue is empty (or if it has to stop for some
> > other reason
> > such as the transmit ring being full, the driver will
> > restart draining
> > of the queue until it is empty, e.g. after it receives a
> > transmit
> > completion interrupt).  Now when the if_transmit
> > routine enqueues the
> > packet, it will get either a real error, 0 if the packet was
> > enqueued
> > and the queue was not idle, or EINPROGRESS if the packet was
> > enqueued
> > and the queue was busy.  For the EINPROGRESS case the
> > if_transmit
> > routine just returns success.  For the 0 case it does a
> > blocking lock
> > on the queue lock and calls the queue's start routine (note
> > that this
> > means that the busy flag is similar to the old OACTIVE
> > interface
> > flag).  This does mean that in some cases you may have
> > one thread that
> > is sending what was the last packet in the buf_ring holding
> > the lock
> > when another thread blocks, and that the first thread will
> > see the new
> > packet when it loops back around so that the second thread
> > is wasting
> > it's time spinning, but in the common case I believe it will
> > give the
> > same parallelism as the current code.  OTOH, there is
> > nothing to
> > prevent multiple threads from "stacking up" in the new
> > approach.  At
> > least the try_fail3 patch ensured only one thread at a time
> > would ever
> > potentially block on the queue lock.
> > 
> > Another approach might be to replace the 'busy' flag with
> > the 'thread
> > pending' flag from try_fail3.patch, but to clear the 'thread
> > pending'
> > flag anytime the dequeue method is called rather than using
> > an
> > explicit 'clear pending' method.  (Hadn't thought of
> > that until
> > writing this e-mail.)  That would prevent multiple
> > threads from
> > waiting on the queue lock perhaps.
> > 
> > Note that the 'busy' approach (or the modification I
> > mentioned above)
> > does rely on the assumption I stated above, i.e. once a
> > driver starts
> > draining a queue, it will drain it until empty unless it
> > hits an
> > "error" condition (link went down, transmit ring full,
> > etc.).  If it
> > hits an "error" condition, the driver is responsible for
> > restarting
> > transmit when the condition clears.  I believe our
> > drivers already
> > work this way now.
> > 
> > The 'busy' patch is at http://www.freebsd.org/~jhb/patches/drbr.patch
> > 
> > -- 
> > John Baldwin
> 
> Q1: Has this been corrected?

No.  I've yet to been able to raise a meaningful discussion about possible 
solutions to this.

> Q2: Are there any case studies or benchmarks for buf_ring, or it is just
> blindly being used because someone claimed it was better and offered it
> for free? One of the points of locking is to avoid race conditions, so the 
fact that you have races in a supposed lock-less scheme seems more than just 
ironic.

The buf_ring author claims it has benefits in high pps workloads.  I am not 
aware of any benchmarks, etc.

-- 
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201212041108.17645.jhb>