From owner-freebsd-net@FreeBSD.ORG  Sun Nov 18 21:49:42 2012
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id 6ADAB9BC
 for <freebsd-net@freebsd.org>; Sun, 18 Nov 2012 21:49:42 +0000 (UTC)
 (envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
 by mx1.freebsd.org (Postfix) with ESMTP id 971C58FC18
 for <freebsd-net@freebsd.org>; Sun, 18 Nov 2012 21:49:41 +0000 (UTC)
Received: (qmail 71061 invoked from network); 18 Nov 2012 23:22:53 -0000
Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2])
 (envelope-sender <andre@freebsd.org>)
 by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
 for <barney_cordoba@yahoo.com>; 18 Nov 2012 23:22:53 -0000
Message-ID: <50A957F3.7020003@freebsd.org>
Date: Sun, 18 Nov 2012 22:49:39 +0100
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:16.0) Gecko/20121026 Thunderbird/16.0.2
MIME-Version: 1.0
To: Barney Cordoba <barney_cordoba@yahoo.com>
Subject: Re: Latency issues with buf_ring
References: <1353259441.19423.YahooMailClassic@web121605.mail.ne1.yahoo.com>
In-Reply-To: <1353259441.19423.YahooMailClassic@web121605.mail.ne1.yahoo.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: freebsd-net@freebsd.org, John Baldwin <jhb@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 18 Nov 2012 21:49:42 -0000

On 18.11.2012 18:24, Barney Cordoba wrote:
>
>
> --- On Thu, 1/19/12, John Baldwin <jhb@freebsd.org> wrote:
>
>> From: John Baldwin <jhb@freebsd.org>
>> Subject: Latency issues with buf_ring
>> To: net@freebsd.org
>> Cc: "Ed Maste" <emaste@freebsd.org>, "Navdeep Parhar" <np@freebsd.org>
>> Date: Thursday, January 19, 2012, 11:41 AM
>> The current buf_ring usage in various
>> NIC drivers has a race that can
>> result in really high packet latencies in some cases.
>> Specifically,
>> the common pattern in an if_transmit routine is to use a
>> try-lock on
>> the queue and if that fails enqueue the packet in the
>> buf_ring and
>> return.  The race, of course, is that the thread
>> holding the lock
>> might have just finished checking the buf_ring and found it
>> empty and
>> be in the process of releasing the lock when the original
>> thread fails
>> the try lock.  If this happens, then the packet queued
>> by the first
>> thread will be stalled until another thread tries to
>> transmit packets
>> for that queue.  Some drivers attempt to handle this
>> race (igb(4)
>> schedules a task to kick the transmit queue if the try lock
>> fails) and
>> others don't (cxgb(4) doesn't handle it at all).  At
>> work this race
>> was triggered very often after upgrading from 7 to 8 with
>> bursty
>> traffic and caused numerous problems, so it is not a rare
>> occurrence
>> and needs to be addressed.
>>
>> (Note, all patches mentioned are against 8)
>>
>> The first hack I tried to use was to simply always lock the
>> queue after
>> the drbr enqueue if the try lock failed and then drain the
>> queue if
>> needed (www.freebsd.org/~jhb/patches/try_fail.patch).
>> While this fixed
>> my latency problems, it would seem that this breaks other
>> workloads
>> that the drbr design is trying to optimize.
>>
>> After further hacking what I came up with was a variant of
>> drbr_enqueue()
>> that would atomically set a 'pending' flag.  During the
>> enqueue operation.
>> The first thread to fail the try lock sets this flag (it is
>> told that it
>> set the flag by a new return value (EINPROGRESS) from the
>> enqueue call).
>> The pending thread then explicitly clears the flag once it
>> acquires the
>> queue lock.  This should prevent multiple threads from
>> stacking up on the
>> queue lock so that if multiple threads are dumping packets
>> into the ring
>> concurrently all but two (the one draining the queue
>> currently and the
>> one waiting for the lock) can continue to drain the
>> queue.  One downside
>> of this approach though is that each driver has to be
>> changed to make
>> an explicit call to clear the pending flag after grabbing
>> the queue lock
>> if the try lock fails.  This is what I am currently
>> running in production
>> (www.freebsd.org/~jhb/patches/try_fail3.patch).
>>
>> However, this still results in a lot of duplicated code in
>> each driver
>> that wants to support multiq.  Several folks have
>> expressed a desire
>> to move in a direction where the stack has explicit
>> knowledge of
>> transmit queues allowing us to hoist some of this duplicated
>> code out
>> of the drivers and up into the calling layer.  After
>> discussing this a
>> bit with Navdeep (np@), the approach I am looking at is to
>> alter the
>> buf_ring code flow a bit to more closely model the older
>> code-flow
>> with IFQ and if_start methods.  That is, have the
>> if_transmit methods
>> always enqueue each packet that arrives to the buf_ring and
>> then to
>> call an if_start-like method that drains a specific transmit
>> queue.
>> This approach simplifies a fair bit of driver code and means
>> we can
>> potentially move the enqueue, etc. bits up into the calling
>> layer and
>> instead have drivers provide the per-transmit queue start
>> routine as
>> the direct function pointer to the upper layers ala
>> if_start.
>>
>> However, we would still need a way to close the latency
>> race.  I've
>> attempted to do that by inverting my previous 'thread
>> pending' flag.
>> Instead, I make the buf_ring store a 'busy' flag.  This
>> flag is
>> managed by the single-consumer buf_ring dequeue method
>> (that
>> drbr_dequeue() uses).  It is set to true when a packet
>> is removed from
>> the queue while there are more packets pending.
>> Conversely, if there
>> are no other pending packets then it is set to false.
>> The assumption
>> is that once a thread starts draining the queue, it will not
>> stop
>> until the queue is empty (or if it has to stop for some
>> other reason
>> such as the transmit ring being full, the driver will
>> restart draining
>> of the queue until it is empty, e.g. after it receives a
>> transmit
>> completion interrupt).  Now when the if_transmit
>> routine enqueues the
>> packet, it will get either a real error, 0 if the packet was
>> enqueued
>> and the queue was not idle, or EINPROGRESS if the packet was
>> enqueued
>> and the queue was busy.  For the EINPROGRESS case the
>> if_transmit
>> routine just returns success.  For the 0 case it does a
>> blocking lock
>> on the queue lock and calls the queue's start routine (note
>> that this
>> means that the busy flag is similar to the old OACTIVE
>> interface
>> flag).  This does mean that in some cases you may have
>> one thread that
>> is sending what was the last packet in the buf_ring holding
>> the lock
>> when another thread blocks, and that the first thread will
>> see the new
>> packet when it loops back around so that the second thread
>> is wasting
>> it's time spinning, but in the common case I believe it will
>> give the
>> same parallelism as the current code.  OTOH, there is
>> nothing to
>> prevent multiple threads from "stacking up" in the new
>> approach.  At
>> least the try_fail3 patch ensured only one thread at a time
>> would ever
>> potentially block on the queue lock.
>>
>> Another approach might be to replace the 'busy' flag with
>> the 'thread
>> pending' flag from try_fail3.patch, but to clear the 'thread
>> pending'
>> flag anytime the dequeue method is called rather than using
>> an
>> explicit 'clear pending' method.  (Hadn't thought of
>> that until
>> writing this e-mail.)  That would prevent multiple
>> threads from
>> waiting on the queue lock perhaps.
>>
>> Note that the 'busy' approach (or the modification I
>> mentioned above)
>> does rely on the assumption I stated above, i.e. once a
>> driver starts
>> draining a queue, it will drain it until empty unless it
>> hits an
>> "error" condition (link went down, transmit ring full,
>> etc.).  If it
>> hits an "error" condition, the driver is responsible for
>> restarting
>> transmit when the condition clears.  I believe our
>> drivers already
>> work this way now.
>>
>> The 'busy' patch is at http://www.freebsd.org/~jhb/patches/drbr.patch
>>
>> --
>> John Baldwin
>
> Q1: Has this been corrected?

I don't know.

> Q2: Are there any case studies or benchmarks for buf_ring, or it is just
> blindly being used because someone claimed it was better and offered it
> for free? One of the points of locking is to avoid race conditions, so
 > the fact that you have races in a supposed lock-less scheme seems more
 > than just ironic.

At the moment our entire ifnet/driver handoff is not optimal.  Especially
the double buffering in the DMA rings and the IFQ (be it the old one or
buf_ring) is troubling.  See "Buffer Bloat" for one aspect of it.  While
hacking up a couple of drivers for automatic irq/polling experimenting
I've come up with a couple of ideas in this area that I'm implementing
right now.  When it's ready meaningful benchmarks will be run to assess
the impact.

-- 
Andre