From owner-freebsd-net@FreeBSD.ORG  Sun Nov 18 17:26:40 2012
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id A854C51A
 for <freebsd-net@freebsd.org>; Sun, 18 Nov 2012 17:26:40 +0000 (UTC)
 (envelope-from barney_cordoba@yahoo.com)
Received: from nm4.bullet.mail.ne1.yahoo.com (nm4.bullet.mail.ne1.yahoo.com
 [98.138.90.67]) by mx1.freebsd.org (Postfix) with ESMTP id 55F458FC08
 for <freebsd-net@freebsd.org>; Sun, 18 Nov 2012 17:26:40 +0000 (UTC)
Received: from [98.138.90.52] by nm4.bullet.mail.ne1.yahoo.com with NNFMP;
 18 Nov 2012 17:24:01 -0000
Received: from [98.138.89.168] by tm5.bullet.mail.ne1.yahoo.com with NNFMP;
 18 Nov 2012 17:24:01 -0000
Received: from [127.0.0.1] by omp1024.mail.ne1.yahoo.com with NNFMP;
 18 Nov 2012 17:24:01 -0000
X-Yahoo-Newman-Property: ymail-3
X-Yahoo-Newman-Id: 861664.27620.bm@omp1024.mail.ne1.yahoo.com
Received: (qmail 28544 invoked by uid 60001); 18 Nov 2012 17:24:01 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024;
 t=1353259441; bh=dAA35egQaYINnNnGSmDWGKPyL3Hn5a5GJa3GwDp6IQ0=;
 h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:Message-ID:Date:From:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding;
 b=kst5qxznPi+bHSVl1GY62Ocb6NiQLavgtUUnwXlHUa1Nl777lZg6spwJ4eqExzpd72gNY2cygL+IL54o5Nq4sXsAvTZpeApZyqZf/+XGu9lAQCImgt2KoRqjy+gDo0L9cYRH7Pr3y+NU2EOs7bfFfkD65iInEmHe6TTfqrrzq9I=
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com;
 h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:Message-ID:Date:From:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding;
 b=ZU/5uXAshpbxQkWb+ffV/I5+0aIW4wNjQDlHJ5EvyYSiKecTLqi4SP9bnsHsIR/X04oSrKGeXP+x+NUo50FZkNA+Acqsrp1IcwT5nMutgQ4kjVpY1suLKUh7A+53tATSvtmJDZEEW9sB4GYcUN/ERj80ERHoT4O3a6fkci2H0y4=;
X-YMail-OSG: 6OuxEckVM1mAZLx9L5yv4hW4Gzto.K09GuZclyfKSzXqQ2W
 cPGh_EWCyRnHqcYDuDkVGGV5QdBVgxorR3ZM9wV1v53RtJmvfjfGkK6bv2X6
 qtovQ2k7VsC3DA.0mPHJBn3fBn3EKDJBbuHhvTXTapvyd8yAIQtQPro2K8Yt
 X1gavqeg9tFh9c_iOkVYf1Nl1aJBMzFph0CbpJS0A13xkBN.W6_UTZx5ZbjP
 f.Pq4akA.oNkKZdUevl2edxN9e22_a7EaUyuyvfJ_GMQ5vYga83TMn_JdxG3
 JER9i3aOjyRYBBicOvhqoXoB_I0Tw7s5_vtgK9fyS9k65Vpt5Sp_5dmaGNR3
 k8BQWDD2aJjGhVesAXH4WRsWj7cnYt7tzybXbEUh7Hpwtn9qckFfCc9vQB.w
 wZfDAGuNEW.U3GM3kuyRW3V_zyO4xYH66qVTqN5kfY2BrdmExNZdbPtInGxZ
 em_23SLimeVEbJwfPzDmEzNwZs_7Y5QrUTtjW_UXlrbCjzLvzcXHv4w.CbTm
 S0TdWP9d0.iDSYa3d5A--
Received: from [174.48.128.27] by web121605.mail.ne1.yahoo.com via HTTP;
 Sun, 18 Nov 2012 09:24:01 PST
X-Rocket-MIMEInfo: 001.001,
 CgotLS0gT24gVGh1LCAxLzE5LzEyLCBKb2huIEJhbGR3aW4gPGpoYkBmcmVlYnNkLm9yZz4gd3JvdGU6Cgo.IEZyb206IEpvaG4gQmFsZHdpbiA8amhiQGZyZWVic2Qub3JnPgo.IFN1YmplY3Q6IExhdGVuY3kgaXNzdWVzIHdpdGggYnVmX3JpbmcKPiBUbzogbmV0QGZyZWVic2Qub3JnCj4gQ2M6ICJFZCBNYXN0ZSIgPGVtYXN0ZUBmcmVlYnNkLm9yZz4sICJOYXZkZWVwIFBhcmhhciIgPG5wQGZyZWVic2Qub3JnPgo.IERhdGU6IFRodXJzZGF5LCBKYW51YXJ5IDE5LCAyMDEyLCAxMTo0MSBBTQo.IFRoZSBjdXIBMAEBAQE-
X-Mailer: YahooMailClassic/15.0.8 YahooMailWebService/0.8.123.460
Message-ID: <1353259441.19423.YahooMailClassic@web121605.mail.ne1.yahoo.com>
Date: Sun, 18 Nov 2012 09:24:01 -0800 (PST)
From: Barney Cordoba <barney_cordoba@yahoo.com>
Subject: Re: Latency issues with buf_ring
To: John Baldwin <jhb@freebsd.org>
In-Reply-To: <201201191141.25998.jhb@freebsd.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: freebsd-net@freebsd.org
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 18 Nov 2012 17:26:40 -0000

=0A=0A--- On Thu, 1/19/12, John Baldwin <jhb@freebsd.org> wrote:=0A=0A> Fro=
m: John Baldwin <jhb@freebsd.org>=0A> Subject: Latency issues with buf_ring=
=0A> To: net@freebsd.org=0A> Cc: "Ed Maste" <emaste@freebsd.org>, "Navdeep =
Parhar" <np@freebsd.org>=0A> Date: Thursday, January 19, 2012, 11:41 AM=0A>=
 The current buf_ring usage in various=0A> NIC drivers has a race that can=
=0A> result in really high packet latencies in some cases.=A0=0A> Specifica=
lly,=0A> the common pattern in an if_transmit routine is to use a=0A> try-l=
ock on=0A> the queue and if that fails enqueue the packet in the=0A> buf_ri=
ng and=0A> return.=A0 The race, of course, is that the thread=0A> holding t=
he lock=0A> might have just finished checking the buf_ring and found it=0A>=
 empty and=0A> be in the process of releasing the lock when the original=0A=
> thread fails=0A> the try lock.=A0 If this happens, then the packet queued=
=0A> by the first=0A> thread will be stalled until another thread tries to=
=0A> transmit packets=0A> for that queue.=A0 Some drivers attempt to handle=
 this=0A> race (igb(4)=0A> schedules a task to kick the transmit queue if t=
he try lock=0A> fails) and=0A> others don't (cxgb(4) doesn't handle it at a=
ll).=A0 At=0A> work this race=0A> was triggered very often after upgrading =
from 7 to 8 with=0A> bursty=0A> traffic and caused numerous problems, so it=
 is not a rare=0A> occurrence=0A> and needs to be addressed.=0A> =0A> (Note=
, all patches mentioned are against 8)=0A> =0A> The first hack I tried to u=
se was to simply always lock the=0A> queue after=0A> the drbr enqueue if th=
e try lock failed and then drain the=0A> queue if=0A> needed (www.freebsd.o=
rg/~jhb/patches/try_fail.patch).=A0=0A> While this fixed=0A> my latency pro=
blems, it would seem that this breaks other=0A> workloads=0A> that the drbr=
 design is trying to optimize.=0A> =0A> After further hacking what I came u=
p with was a variant of=0A> drbr_enqueue()=0A> that would atomically set a =
'pending' flag.=A0 During the=0A> enqueue operation.=0A> The first thread t=
o fail the try lock sets this flag (it is=0A> told that it=0A> set the flag=
 by a new return value (EINPROGRESS) from the=0A> enqueue call).=0A> The pe=
nding thread then explicitly clears the flag once it=0A> acquires the=0A> q=
ueue lock.=A0 This should prevent multiple threads from=0A> stacking up on =
the=0A> queue lock so that if multiple threads are dumping packets=0A> into=
 the ring=0A> concurrently all but two (the one draining the queue=0A> curr=
ently and the=0A> one waiting for the lock) can continue to drain the=0A> q=
ueue.=A0 One downside=0A> of this approach though is that each driver has t=
o be=0A> changed to make=0A> an explicit call to clear the pending flag aft=
er grabbing=0A> the queue lock=0A> if the try lock fails.=A0 This is what I=
 am currently=0A> running in production=0A> (www.freebsd.org/~jhb/patches/t=
ry_fail3.patch).=0A> =0A> However, this still results in a lot of duplicate=
d code in=0A> each driver=0A> that wants to support multiq.=A0 Several folk=
s have=0A> expressed a desire=0A> to move in a direction where the stack ha=
s explicit=0A> knowledge of=0A> transmit queues allowing us to hoist some o=
f this duplicated=0A> code out=0A> of the drivers and up into the calling l=
ayer.=A0 After=0A> discussing this a=0A> bit with Navdeep (np@), the approa=
ch I am looking at is to=0A> alter the=0A> buf_ring code flow a bit to more=
 closely model the older=0A> code-flow=0A> with IFQ and if_start methods.=
=A0 That is, have the=0A> if_transmit methods=0A> always enqueue each packe=
t that arrives to the buf_ring and=0A> then to=0A> call an if_start-like me=
thod that drains a specific transmit=0A> queue.=0A> This approach simplifie=
s a fair bit of driver code and means=0A> we can=0A> potentially move the e=
nqueue, etc. bits up into the calling=0A> layer and=0A> instead have driver=
s provide the per-transmit queue start=0A> routine as=0A> the direct functi=
on pointer to the upper layers ala=0A> if_start.=0A> =0A> However, we would=
 still need a way to close the latency=0A> race.=A0 I've=0A> attempted to d=
o that by inverting my previous 'thread=0A> pending' flag.=0A> Instead, I m=
ake the buf_ring store a 'busy' flag.=A0 This=0A> flag is=0A> managed by th=
e single-consumer buf_ring dequeue method=0A> (that=0A> drbr_dequeue() uses=
).=A0 It is set to true when a packet=0A> is removed from=0A> the queue whi=
le there are more packets pending.=A0=0A> Conversely, if there=0A> are no o=
ther pending packets then it is set to false.=A0=0A> The assumption=0A> is =
that once a thread starts draining the queue, it will not=0A> stop=0A> unti=
l the queue is empty (or if it has to stop for some=0A> other reason=0A> su=
ch as the transmit ring being full, the driver will=0A> restart draining=0A=
> of the queue until it is empty, e.g. after it receives a=0A> transmit=0A>=
 completion interrupt).=A0 Now when the if_transmit=0A> routine enqueues th=
e=0A> packet, it will get either a real error, 0 if the packet was=0A> enqu=
eued=0A> and the queue was not idle, or EINPROGRESS if the packet was=0A> e=
nqueued=0A> and the queue was busy.=A0 For the EINPROGRESS case the=0A> if_=
transmit=0A> routine just returns success.=A0 For the 0 case it does a=0A> =
blocking lock=0A> on the queue lock and calls the queue's start routine (no=
te=0A> that this=0A> means that the busy flag is similar to the old OACTIVE=
=0A> interface=0A> flag).=A0 This does mean that in some cases you may have=
=0A> one thread that=0A> is sending what was the last packet in the buf_rin=
g holding=0A> the lock=0A> when another thread blocks, and that the first t=
hread will=0A> see the new=0A> packet when it loops back around so that the=
 second thread=0A> is wasting=0A> it's time spinning, but in the common cas=
e I believe it will=0A> give the=0A> same parallelism as the current code.=
=A0 OTOH, there is=0A> nothing to=0A> prevent multiple threads from "stacki=
ng up" in the new=0A> approach.=A0 At=0A> least the try_fail3 patch ensured=
 only one thread at a time=0A> would ever=0A> potentially block on the queu=
e lock.=0A> =0A> Another approach might be to replace the 'busy' flag with=
=0A> the 'thread=0A> pending' flag from try_fail3.patch, but to clear the '=
thread=0A> pending'=0A> flag anytime the dequeue method is called rather th=
an using=0A> an=0A> explicit 'clear pending' method.=A0 (Hadn't thought of=
=0A> that until=0A> writing this e-mail.)=A0 That would prevent multiple=0A=
> threads from=0A> waiting on the queue lock perhaps.=0A> =0A> Note that th=
e 'busy' approach (or the modification I=0A> mentioned above)=0A> does rely=
 on the assumption I stated above, i.e. once a=0A> driver starts=0A> draini=
ng a queue, it will drain it until empty unless it=0A> hits an=0A> "error" =
condition (link went down, transmit ring full,=0A> etc.).=A0 If it=0A> hits=
 an "error" condition, the driver is responsible for=0A> restarting=0A> tra=
nsmit when the condition clears.=A0 I believe our=0A> drivers already=0A> w=
ork this way now.=0A> =0A> The 'busy' patch is at http://www.freebsd.org/~j=
hb/patches/drbr.patch=0A> =0A> -- =0A> John Baldwin=0A=0AQ1: Has this been =
corrected?=0A=0AQ2: Are there any case studies or benchmarks for buf_ring, =
or it is just=0Ablindly being used because someone claimed it was better an=
d offered it=0Afor free? One of the points of locking is to avoid race cond=
itions, so the fact that you have races in a supposed lock-less scheme seem=
s more than just ironic.=0A=0A=0ABC