From owner-freebsd-net@FreeBSD.ORG  Thu Jan 19 16:41:29 2012
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B0B321065686;
	Thu, 19 Jan 2012 16:41:29 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 70A4A8FC27;
	Thu, 19 Jan 2012 16:41:29 +0000 (UTC)
Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [96.47.65.170])
	by cyrus.watson.org (Postfix) with ESMTPSA id 2515646B06;
	Thu, 19 Jan 2012 11:41:29 -0500 (EST)
Received: from jhbbsd.localnet (unknown [209.249.190.124])
	by bigwig.baldwin.cx (Postfix) with ESMTPSA id 851E2B922;
	Thu, 19 Jan 2012 11:41:28 -0500 (EST)
From: John Baldwin <jhb@freebsd.org>
To: net@freebsd.org
Date: Thu, 19 Jan 2012 11:41:25 -0500
User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p10; KDE/4.5.5; amd64; ; )
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="us-ascii"
Content-Transfer-Encoding: 7bit
Message-Id: <201201191141.25998.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
	(bigwig.baldwin.cx); Thu, 19 Jan 2012 11:41:28 -0500 (EST)
Cc: Ed Maste <emaste@freebsd.org>, Navdeep Parhar <np@freebsd.org>
Subject: Latency issues with buf_ring
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 19 Jan 2012 16:41:29 -0000

The current buf_ring usage in various NIC drivers has a race that can
result in really high packet latencies in some cases.  Specifically,
the common pattern in an if_transmit routine is to use a try-lock on
the queue and if that fails enqueue the packet in the buf_ring and
return.  The race, of course, is that the thread holding the lock
might have just finished checking the buf_ring and found it empty and
be in the process of releasing the lock when the original thread fails
the try lock.  If this happens, then the packet queued by the first
thread will be stalled until another thread tries to transmit packets
for that queue.  Some drivers attempt to handle this race (igb(4)
schedules a task to kick the transmit queue if the try lock fails) and
others don't (cxgb(4) doesn't handle it at all).  At work this race
was triggered very often after upgrading from 7 to 8 with bursty
traffic and caused numerous problems, so it is not a rare occurrence
and needs to be addressed.

(Note, all patches mentioned are against 8)

The first hack I tried to use was to simply always lock the queue after
the drbr enqueue if the try lock failed and then drain the queue if
needed (www.freebsd.org/~jhb/patches/try_fail.patch).  While this fixed
my latency problems, it would seem that this breaks other workloads
that the drbr design is trying to optimize.

After further hacking what I came up with was a variant of drbr_enqueue()
that would atomically set a 'pending' flag.  During the enqueue operation.
The first thread to fail the try lock sets this flag (it is told that it
set the flag by a new return value (EINPROGRESS) from the enqueue call).
The pending thread then explicitly clears the flag once it acquires the
queue lock.  This should prevent multiple threads from stacking up on the
queue lock so that if multiple threads are dumping packets into the ring
concurrently all but two (the one draining the queue currently and the
one waiting for the lock) can continue to drain the queue.  One downside
of this approach though is that each driver has to be changed to make
an explicit call to clear the pending flag after grabbing the queue lock
if the try lock fails.  This is what I am currently running in production
(www.freebsd.org/~jhb/patches/try_fail3.patch).

However, this still results in a lot of duplicated code in each driver
that wants to support multiq.  Several folks have expressed a desire
to move in a direction where the stack has explicit knowledge of
transmit queues allowing us to hoist some of this duplicated code out
of the drivers and up into the calling layer.  After discussing this a
bit with Navdeep (np@), the approach I am looking at is to alter the
buf_ring code flow a bit to more closely model the older code-flow
with IFQ and if_start methods.  That is, have the if_transmit methods
always enqueue each packet that arrives to the buf_ring and then to
call an if_start-like method that drains a specific transmit queue.
This approach simplifies a fair bit of driver code and means we can
potentially move the enqueue, etc. bits up into the calling layer and
instead have drivers provide the per-transmit queue start routine as
the direct function pointer to the upper layers ala if_start.

However, we would still need a way to close the latency race.  I've
attempted to do that by inverting my previous 'thread pending' flag.
Instead, I make the buf_ring store a 'busy' flag.  This flag is
managed by the single-consumer buf_ring dequeue method (that
drbr_dequeue() uses).  It is set to true when a packet is removed from
the queue while there are more packets pending.  Conversely, if there
are no other pending packets then it is set to false.  The assumption
is that once a thread starts draining the queue, it will not stop
until the queue is empty (or if it has to stop for some other reason
such as the transmit ring being full, the driver will restart draining
of the queue until it is empty, e.g. after it receives a transmit
completion interrupt).  Now when the if_transmit routine enqueues the
packet, it will get either a real error, 0 if the packet was enqueued
and the queue was not idle, or EINPROGRESS if the packet was enqueued
and the queue was busy.  For the EINPROGRESS case the if_transmit
routine just returns success.  For the 0 case it does a blocking lock
on the queue lock and calls the queue's start routine (note that this
means that the busy flag is similar to the old OACTIVE interface
flag).  This does mean that in some cases you may have one thread that
is sending what was the last packet in the buf_ring holding the lock
when another thread blocks, and that the first thread will see the new
packet when it loops back around so that the second thread is wasting
it's time spinning, but in the common case I believe it will give the
same parallelism as the current code.  OTOH, there is nothing to
prevent multiple threads from "stacking up" in the new approach.  At
least the try_fail3 patch ensured only one thread at a time would ever
potentially block on the queue lock.

Another approach might be to replace the 'busy' flag with the 'thread
pending' flag from try_fail3.patch, but to clear the 'thread pending'
flag anytime the dequeue method is called rather than using an
explicit 'clear pending' method.  (Hadn't thought of that until
writing this e-mail.)  That would prevent multiple threads from
waiting on the queue lock perhaps.

Note that the 'busy' approach (or the modification I mentioned above)
does rely on the assumption I stated above, i.e. once a driver starts
draining a queue, it will drain it until empty unless it hits an
"error" condition (link went down, transmit ring full, etc.).  If it
hits an "error" condition, the driver is responsible for restarting
transmit when the condition clears.  I believe our drivers already
work this way now.

The 'busy' patch is at http://www.freebsd.org/~jhb/patches/drbr.patch

-- 
John Baldwin